NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
20 stars 0 forks source link

Reproject ingested datasets #857

Closed fvankrieken closed 3 months ago

fvankrieken commented 4 months ago

https://github.com/NYCPlanning/data-engineering/issues/652

Ended up doing a handful of things, don't be put off by the thousand lines, there's a bunch of generated stuff in one commit. Definitely one to go by commit. This will be squashed

Commits

Minor stuff

  1. Actually archive datasets at the end of ingest.run! I added a helper since this is very similar to archiving a raw dataset Both are tested.
  2. add pyarrow-stubs for mypy, recompile requirements
  3. remove an outdated cli target
  4. put extracted zip files during ingest in the same folder as the file being unzipped
  5. use zip file logic for dcp_pad

geo stuff

  1. BIG models. I used datamodel-codegen to turn the json schemas for geoparquet and projjson (used within geoparquet) to create pydantic models, then edited them a bunch. Seemed nice to be able to read CRS from a geoparquet file without opening it, but is also overkill. I'm happy to not fully check this in, but this seems like a nice thing to base our own geospatial stuff around. Open to thoughts
    • I think a good question here is - do we see value in having this? Should I just go ahead and pare this down? It is nice to have the geoparquet metadata read in structured, but it might be way overkill
    • One good sized tweak I made from the codegen output was going from RootModel -> TypeAlias. RootModel seems to be Pydantic's preferred route of handling these sorts of union types, but it gets clunky when nested (so in some cases you need obj.id, others you need obj.root.id. Could maybe get around it with methods for the class but it was annoying)
  2. Check whether a parquet is geo or not, and load it as a df or geodf!
  3. actually reproject dataframe
  4. alter geom definition for bpl_libraries to work - theirs isn't actually wkt (which would be "Point(-74.0 40.0)" but instead just x, y ("-74.0, 40.0"). I've added this as a possible geom format. This also moves around a bunch of our geospatial code - moves a few things into models, consolidates to one "df to gdf" function, and makes sure new functionality is tested.
  5. Code affected by 9 outside of dcpy
  6. Clean up duplicate postgres utils for reading in table as df as well as add a read_gdf method - templatedb was reading in table as df and then converting, and old method was covering up errors.
  7. remove a bunch of print statements
sf-dcp commented 4 months ago

holding off my review until the tests are passing

fvankrieken commented 4 months ago

holding off my review until the tests are passing

Only test failing is the one missing credentials that's failing on main, this is good to go

sf-dcp commented 4 months ago

have to do gft PR and will get back to the review later today

sf-dcp commented 4 months ago

I have some thoughts on the #6 commit. Geoparquet is a comparatively new data format, and I think it’s fair to say it may undergo future iterations, including its metadata schema. If true, it adds overhead for us to maintain the pydentic classes to align with the schema. Like Alex mentioned, perhaps we could do it on the fly using something like this?

fvankrieken commented 4 months ago

I have some thoughts on the #6 commit. Geoparquet is a comparatively new data format, and I think it’s fair to say it may undergo future iterations, including its metadata schema. If true, it adds overhead for us to maintain the pydentic classes to align with the schema. Like Alex mentioned, perhaps we could do it on the fly using something like this?

I think that makes some sense. Most of this generated code is not from geoparquet specification though, it's from projjson (which is a subfield of the geoparquet metadata), which is a bit more mature. Geoparquet is pretty minimal - just the Columns class and GeoParquet class in commit 6.

And bizarre seeing you link that FileMetaData class, because I was about to say "well that's what I used"... but I don't see it in my code anywhere? For some reason I have that field as "Any" in models/.../parquet.py. That's supposed to be this exact object. See this line, that returns a FileMetaData that gets fed into the parquet.MetaData object that I've defined. It doesn't contain geospatial info though - that just is part of an unstructured dict (from python's perspective). We could do a lighter-weight pydantic model that ignores extra args, and just gets the "crs" of Columns rather than trying to get the whole column

fvankrieken commented 4 months ago

Though to my point that projjson is more "mature" - it is, though its json schema itself is only v0.7, meaning they might be refining that still. Might still be best to do something a little more lightweight/dynamic

sf-dcp commented 4 months ago

Though to my point that projjson is more "mature" - it is, though its json schema itself is only v0.7, meaning they might be refining that still. Might still be best to do something a little more lightweight/dynamic

Could you please explain why you leaning towards using a class when reading in parquet metadata? Maybe I'm missing the point.

Let me explain my thought process. We only want a relatively small subset of information from parquet metadata, so we don't need to know about other objects. And, when parquet metadata schema changes in the future, we will get an error querying dict keys we need regardless whether we use a class or the raw data representation. I think using a pydantic class (or any other validation class) is more applicable for data that we control, like our recipe templates. And in a situation with parquet, it seems to only add overhead for us.

fvankrieken commented 4 months ago

Could you please explain why you leaning towards using a class when reading in parquet metadata? Maybe I'm missing the point.

Let me explain my thought process. We only want a relatively small subset of information from parquet metadata, so we don't need to know about other objects. And, when parquet metadata schema changes in the future, we will get an error querying dict keys we need regardless whether we use a class or the raw data representation. I think using a pydantic class (or any other validation class) is more applicable for data that we control, like our recipe templates. And in a situation with parquet, it seems to only add overhead for us.

Doesn't need to be a class I guess, mainly I just like type safety. If we read in GeoParquet metadata, we should be able to read crs or epsg_id or something like that and know it's present and a string (or a more complex object), and throw an error if we can't find it, since having this information is part of the GeoParquet spec, and we rely on it for reading in geoparquet files to geopandas. Pydantic offers nice utilities around reading in dictionaries and having some sort of error handling when it doesn't match the structure you expect, and we use it all over the code base, hence why I used it here. The codegen tool I used also specifically creates pydantic objects from jsonschema.

If we start working more with different types of CRSs, it's nice to have these as type-safe proper python objects, and even better if they align with accepted standards for these things like projjson (which is already a standard being used by geoparquet). I agree that this might not be useful or practical right now in terms of the overhead it adds (in terms of the huge projjson file, the complete Columns object within the geoparquet metadata)

fvankrieken commented 4 months ago

Could you please explain why you leaning towards using a class when reading in parquet metadata? Maybe I'm missing the point. Let me explain my thought process. We only want a relatively small subset of information from parquet metadata, so we don't need to know about other objects. And, when parquet metadata schema changes in the future, we will get an error querying dict keys we need regardless whether we use a class or the raw data representation. I think using a pydantic class (or any other validation class) is more applicable for data that we control, like our recipe templates. And in a situation with parquet, it seems to only add overhead for us.

Doesn't need to be a class I guess, mainly I just like type safety. If we read in GeoParquet metadata, we should be able to read crs or epsg_id or something like that and know it's present and a string (or a more complex object), and throw an error if we can't find it, since having this information is part of the GeoParquet spec, and we rely on it for reading in geoparquet files to geopandas. Pydantic offers nice utilities around reading in dictionaries and having some sort of error handling when it doesn't match the structure you expect, and we use it all over the code base, hence why I used it here. The codegen tool I used also specifically creates pydantic objects from jsonschema.

If we start working more with different types of CRSs, it's nice to have these as type-safe proper python objects, and even better if they align with accepted standards for these things like projjson (which is already a standard being used by geoparquet). I agree that this might not be useful or practical right now in terms of the overhead it adds (in terms of the huge projjson file, the complete Columns object within the geoparquet metadata)

One cool use case - I think crs's, datums etc typically have associated bounding boxes in these objects. That would make it really easy to add a check to see if all geoms in the dataset are in its (supposed) crs's limits, maybe giving us a really simple (if not exhaustive) way to catch incorrect projections, or at least entirely invalid ones. Won't catch things like wrongly assuming a wgs84 dataset is state plain (since state plain's bounding box is much bigger), but at least it would catch things in the other direction. Maybe a little less useful than I thought when I started writing this comment

sf-dcp commented 4 months ago

Regarding the geospatial check you described: we will be doing validation step on an entire dataset as a part of ingest and it's probably best to keep all checks in one "place" using one utility/framework rather than referring to metadata info for validation.

Thanks for sharing your perspective on using pydentic for metadata. I still think it's more than what we need, at least as of today. I'm also okay with it if you strongly believe it's a better approach.

A side rec to break PRs into smaller ones to make review faster. Otherwise, good stuff

fvankrieken commented 4 months ago

Added a couple commits to address comments!

sf-dcp commented 3 months ago

Added a couple commits to address comments!

LGTM!