Data library: convert raw to parquet v2.0

sf-dcp commented 6 months ago

This is a continuation of #658 PR converting raw data to parquet. Related to #631 issue.

It's probably easiest to review the PR commit by commit. A big chunk of changed files here is revised test data used in tests.

Major changes:

1) Originally, the to_parquet function was responsible for reading a local dataset into a pandas or geopandas dataframe (pandas df if data isn't geospatial) and then output to a parquet/geoparquet file.

In this PR, we are refactoring this function into 2 functions: reading data into a dataframe & outputting to parquet. First reason is the need to expand the function to work with more and more input data formats --> more code. And second reason is that a code snippet reading local data into a dataframe was repeated in tests. Thus, the need to refactor.

2) Implement a zipped input data format in to_parquet in 2 steps: unzip a file & use existing code to process an unzipped file. I added tests for zipped csv, zipped shapefile, and zipped geodatabase.

3) Implement a csv format that has longitude and latitude columns instead of one geometry column. Test is also present.

4) Name geometry column as geom in output parquet file.

5) Due to added data formats, expand test code, generating fake data. Note, test code is becoming messy, and we will refactor it in a separate PR as it will touch tests outside of to_parquet function.

Side note

to_parquet fn converts geospatial input data into a geoparquet data format: no need to explicitly specify a file extension to be .geoparquet. This is because geopandas df automatically becomes geoparquet.

`TODO` for next PR:

[ ] json
[ ] geojson
[ ] excel
[ ] simplify tests (perhaps by using complete, not partial, fake templates) From a convo with Finn, we can simplify tests by using a subset of the config object as a function input instead of the entire thing.

sf-dcp commented 5 months ago

What test data used in test_to_parquet fn looks like:

In .csv format only:
In .csv, .shp, .gdb, zipped .csv/.shp/.gdb formats:

fvankrieken commented 5 months ago

A couple small notes, but other than that this looks great

sf-dcp commented 5 months ago

@fvankrieken, I added 3 commits based on your recs

NYCPlanning / data-engineering