PermafrostDiscoveryGateway / viz-staging

PDG Visualization staging pipeline
Apache License 2.0
2 stars 1 forks source link

Explore GeoParquet file format as an input to staging #40

Open julietcohen opened 3 months ago

julietcohen commented 3 months ago

Apache Parquet is described as a modern alternative to csv files, and GeoParquet adds interoperable geospatial types (Point, Line, Polygon) to Parquet (source). Initial exploration is needed to determine if and how we can stage vector data in GeoParquet format. This format should be great for processing large quantities of data as it increases efficiency in analytical based use cases.

Suggested by Ingmar Nitze. A good initial step would be to either find a small GeoParquet file or receive one from Ingmar. This should be uploaded to /var/data/submission/pdg/...

julietcohen commented 2 months ago

Ingmar provided 2 parquet files for data in adjacent UTM zones 32617 & 32618. These have been been uploaded to a new directory: /var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/ This directory also contains 2 geopackage files of the same data.

julietcohen commented 3 weeks ago

In order to import a parquet file:

import geopandas as gpd
import geoparquet
import pyarrow

data = gpd.read_parquet("/var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/32617_river.parquet")

In the config, we ask the user to specify the extension of the input file here. ext_input is used here when we pair footprints to their vector files (if we are visualizing data that has footprints).

When we read in vectors to staged, we use geopandas.read_file(). Just before this, we can insert a check for the value of ext_input in the config like this:

ext_input = config.get('ext_input')

And use geopandas.read_parquet() instead of read_file() if the ext_input is ".parquet", and use read_file() if the extension is anything else.