Explore GeoParquet file format as an input to staging

julietcohen commented 3 months ago

Apache Parquet is described as a modern alternative to csv files, and GeoParquet adds interoperable geospatial types (Point, Line, Polygon) to Parquet (source). Initial exploration is needed to determine if and how we can stage vector data in GeoParquet format. This format should be great for processing large quantities of data as it increases efficiency in analytical based use cases.

Suggested by Ingmar Nitze. A good initial step would be to either find a small GeoParquet file or receive one from Ingmar. This should be uploaded to /var/data/submission/pdg/...

julietcohen commented 2 months ago

Ingmar provided 2 parquet files for data in adjacent UTM zones 32617 & 32618. These have been been uploaded to a new directory: /var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/ This directory also contains 2 geopackage files of the same data.

julietcohen commented 3 weeks ago

In order to import a parquet file:

we can use geopandas.read_parquet.() and it imports the file as a geodataframe (identifies the geometry column without the user needing to specify)
need to package installed pyarrow
I also had installed geoparquet package into my python env but we can test is that is necessary

import geopandas as gpd
import geoparquet
import pyarrow

data = gpd.read_parquet("/var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/32617_river.parquet")

In the config, we ask the user to specify the extension of the input file here. ext_input is used here when we pair footprints to their vector files (if we are visualizing data that has footprints).

When we read in vectors to staged, we use geopandas.read_file(). Just before this, we can insert a check for the value of ext_input in the config like this:

ext_input = config.get('ext_input')

And use geopandas.read_parquet() instead of read_file() if the ext_input is ".parquet", and use read_file() if the extension is anything else.

PermafrostDiscoveryGateway / viz-staging

Explore GeoParquet file format as an input to staging #40