Open julietcohen opened 9 months ago
Ingmar provided 2 parquet files for data in adjacent UTM zones 32617 & 32618. These have been been uploaded to a new directory: /var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/
This directory also contains 2 geopackage files of the same data.
In order to import a parquet file:
geopandas.read_parquet.()
and it imports the file as a geodataframe (identifies the geometry column without the user needing to specify)pyarrow
import geopandas as gpd
import geoparquet
import pyarrow
data = gpd.read_parquet("/var/data/submission/pdg/nitze_lake_change/data_sample_parquet_20240219/32617_river.parquet")
In the config, we ask the user to specify the extension of the input file here. ext_input
is used here when we pair footprints to their vector files (if we are visualizing data that has footprints).
When we read in vectors to staged, we use geopandas.read_file()
. Just before this, we can insert a check for the value of ext_input
in the config like this:
ext_input = config.get('ext_input')
And use geopandas.read_parquet()
instead of read_file()
if the ext_input
is ".parquet"
, and use read_file()
if the extension is anything else.
Apache Parquet is described as a modern alternative to csv files, and GeoParquet adds interoperable geospatial types (Point, Line, Polygon) to Parquet (source). Initial exploration is needed to determine if and how we can stage vector data in GeoParquet format. This format should be great for processing large quantities of data as it increases efficiency in analytical based use cases.
Suggested by Ingmar Nitze. A good initial step would be to either find a small GeoParquet file or receive one from Ingmar. This should be uploaded to
/var/data/submission/pdg/...