NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
20 stars 0 forks source link

Data library: convert raw to parquet v2.0 #729

Closed sf-dcp closed 5 months ago

sf-dcp commented 6 months ago

This is a continuation of #658 PR converting raw data to parquet. Related to #631 issue.

It's probably easiest to review the PR commit by commit. A big chunk of changed files here is revised test data used in tests.

Major changes:

1) Originally, the to_parquet function was responsible for reading a local dataset into a pandas or geopandas dataframe (pandas df if data isn't geospatial) and then output to a parquet/geoparquet file.

In this PR, we are refactoring this function into 2 functions: reading data into a dataframe & outputting to parquet. First reason is the need to expand the function to work with more and more input data formats --> more code. And second reason is that a code snippet reading local data into a dataframe was repeated in tests. Thus, the need to refactor.

2) Implement a zipped input data format in to_parquet in 2 steps: unzip a file & use existing code to process an unzipped file. I added tests for zipped csv, zipped shapefile, and zipped geodatabase.

3) Implement a csv format that has longitude and latitude columns instead of one geometry column. Test is also present.

4) Name geometry column as geom in output parquet file.

5) Due to added data formats, expand test code, generating fake data. Note, test code is becoming messy, and we will refactor it in a separate PR as it will touch tests outside of to_parquet function.

Side note

to_parquet fn converts geospatial input data into a geoparquet data format: no need to explicitly specify a file extension to be .geoparquet. This is because geopandas df automatically becomes geoparquet.

TODO for next PR:

sf-dcp commented 5 months ago

What test data used in test_to_parquet fn looks like:

fvankrieken commented 5 months ago

A couple small notes, but other than that this looks great

sf-dcp commented 5 months ago

@fvankrieken, I added 3 commits based on your recs