Closed fvankrieken closed 3 months ago
holding off my review until the tests are passing
holding off my review until the tests are passing
Only test failing is the one missing credentials that's failing on main, this is good to go
have to do gft PR and will get back to the review later today
I have some thoughts on the #6 commit. Geoparquet is a comparatively new data format, and I think it’s fair to say it may undergo future iterations, including its metadata schema. If true, it adds overhead for us to maintain the pydentic classes to align with the schema. Like Alex mentioned, perhaps we could do it on the fly using something like this?
I have some thoughts on the #6 commit. Geoparquet is a comparatively new data format, and I think it’s fair to say it may undergo future iterations, including its metadata schema. If true, it adds overhead for us to maintain the pydentic classes to align with the schema. Like Alex mentioned, perhaps we could do it on the fly using something like this?
I think that makes some sense. Most of this generated code is not from geoparquet specification though, it's from projjson (which is a subfield of the geoparquet metadata), which is a bit more mature. Geoparquet is pretty minimal - just the Columns
class and GeoParquet
class in commit 6.
And bizarre seeing you link that FileMetaData
class, because I was about to say "well that's what I used"... but I don't see it in my code anywhere? For some reason I have that field as "Any" in models/.../parquet.py
. That's supposed to be this exact object. See this line, that returns a FileMetaData
that gets fed into the parquet.MetaData
object that I've defined. It doesn't contain geospatial info though - that just is part of an unstructured dict (from python's perspective). We could do a lighter-weight pydantic model that ignores extra args, and just gets the "crs" of Columns
rather than trying to get the whole column
Though to my point that projjson is more "mature" - it is, though its json schema itself is only v0.7, meaning they might be refining that still. Might still be best to do something a little more lightweight/dynamic
Though to my point that projjson is more "mature" - it is, though its json schema itself is only v0.7, meaning they might be refining that still. Might still be best to do something a little more lightweight/dynamic
Could you please explain why you leaning towards using a class when reading in parquet metadata? Maybe I'm missing the point.
Let me explain my thought process. We only want a relatively small subset of information from parquet metadata, so we don't need to know about other objects. And, when parquet metadata schema changes in the future, we will get an error querying dict keys we need regardless whether we use a class or the raw data representation. I think using a pydantic class (or any other validation class) is more applicable for data that we control, like our recipe templates. And in a situation with parquet, it seems to only add overhead for us.
Could you please explain why you leaning towards using a class when reading in parquet metadata? Maybe I'm missing the point.
Let me explain my thought process. We only want a relatively small subset of information from parquet metadata, so we don't need to know about other objects. And, when parquet metadata schema changes in the future, we will get an error querying dict keys we need regardless whether we use a class or the raw data representation. I think using a pydantic class (or any other validation class) is more applicable for data that we control, like our recipe templates. And in a situation with parquet, it seems to only add overhead for us.
Doesn't need to be a class I guess, mainly I just like type safety. If we read in GeoParquet metadata, we should be able to read crs
or epsg_id
or something like that and know it's present and a string (or a more complex object), and throw an error if we can't find it, since having this information is part of the GeoParquet spec, and we rely on it for reading in geoparquet files to geopandas. Pydantic offers nice utilities around reading in dictionaries and having some sort of error handling when it doesn't match the structure you expect, and we use it all over the code base, hence why I used it here. The codegen tool I used also specifically creates pydantic objects from jsonschema.
If we start working more with different types of CRSs, it's nice to have these as type-safe proper python objects, and even better if they align with accepted standards for these things like projjson (which is already a standard being used by geoparquet). I agree that this might not be useful or practical right now in terms of the overhead it adds (in terms of the huge projjson file, the complete Columns
object within the geoparquet metadata)
Could you please explain why you leaning towards using a class when reading in parquet metadata? Maybe I'm missing the point. Let me explain my thought process. We only want a relatively small subset of information from parquet metadata, so we don't need to know about other objects. And, when parquet metadata schema changes in the future, we will get an error querying dict keys we need regardless whether we use a class or the raw data representation. I think using a pydantic class (or any other validation class) is more applicable for data that we control, like our recipe templates. And in a situation with parquet, it seems to only add overhead for us.
Doesn't need to be a class I guess, mainly I just like type safety. If we read in GeoParquet metadata, we should be able to read
crs
orepsg_id
or something like that and know it's present and a string (or a more complex object), and throw an error if we can't find it, since having this information is part of the GeoParquet spec, and we rely on it for reading in geoparquet files to geopandas. Pydantic offers nice utilities around reading in dictionaries and having some sort of error handling when it doesn't match the structure you expect, and we use it all over the code base, hence why I used it here. The codegen tool I used also specifically creates pydantic objects from jsonschema.If we start working more with different types of CRSs, it's nice to have these as type-safe proper python objects, and even better if they align with accepted standards for these things like projjson (which is already a standard being used by geoparquet). I agree that this might not be useful or practical right now in terms of the overhead it adds (in terms of the huge projjson file, the complete
Columns
object within the geoparquet metadata)
One cool use case - I think crs's, datums etc typically have associated bounding boxes in these objects. That would make it really easy to add a check to see if all geoms in the dataset are in its (supposed) crs's limits, maybe giving us a really simple (if not exhaustive) way to catch incorrect projections, or at least entirely invalid ones. Won't catch things like wrongly assuming a wgs84 dataset is state plain (since state plain's bounding box is much bigger), but at least it would catch things in the other direction. Maybe a little less useful than I thought when I started writing this comment
Regarding the geospatial check you described: we will be doing validation step on an entire dataset as a part of ingest
and it's probably best to keep all checks in one "place" using one utility/framework rather than referring to metadata info for validation.
Thanks for sharing your perspective on using pydentic for metadata. I still think it's more than what we need, at least as of today. I'm also okay with it if you strongly believe it's a better approach.
A side rec to break PRs into smaller ones to make review faster. Otherwise, good stuff
Added a couple commits to address comments!
Added a couple commits to address comments!
LGTM!
https://github.com/NYCPlanning/data-engineering/issues/652
Ended up doing a handful of things, don't be put off by the thousand lines, there's a bunch of generated stuff in one commit. Definitely one to go by commit. This will be squashed
Commits
Minor stuff
ingest.run
! I added a helper since this is very similar to archiving a raw dataset Both are tested.geo stuff
obj.id
, others you needobj.root.id
. Could maybe get around it with methods for the class but it was annoying)