NASA-IMPACT / veda-data

3 stars 0 forks source link

Add collection validation #53

Closed gadomski closed 1 year ago

gadomski commented 1 year ago

Includes:

N.B. there were a strange amount of numbers-as-strings in the bboxes, so I'd guess there's something funny going on in the data generation step where the inputs are being stringified? I'll try to track down where this stuff comes from but I'm sure someone already knows :-P.

I also don't know if fixing the collections here in this repo is the first/only step -- I'm assuming these changes need to be sync-d with a DB somewhere?

j08lue commented 1 year ago

there were a strange amount of numbers-as-strings in the bboxes

That sounds familiar - I think we had that with EPSG codes, too, before and it had to do with serialization/deserialization, but @anayeaye would know better...

gadomski commented 1 year ago

I think we had that with EPSG codes, too

The EPSG issue also appeared here: https://github.com/NASA-IMPACT/veda-data/issues/57.

anayeaye commented 1 year ago

@gadomski we have had an assortment of invalid number formats and I think we determined that an ingest queueing DDB JSON serialization step was the root cause, which was fixed downstream here. In some cases updating the stac-extension version in items led to more graceful handling of numeric formats (need to find the ref where @jsignell worked on a pystac fix).

We've gone back and forth between updating and using the fixed construct from eoapi-cdk (which is behind our pgstac version) and adding the fix to our veda-stac-ingestor for now. @ividito did we end up making a change in veda-stac-ingestor?

ividito commented 1 year ago

@ividito did we end up making a change in veda-stac-ingestor?

Nope, it made it into our different ingest-api experiments but never got applied to our live ingestor.

I also don't know if fixing the collections here in this repo is the first/only step -- I'm assuming these changes need to be sync-d with a DB somewhere?

I think these fixes reflect the various changes made in the ingestor at different points in time. I'll double check a few of them, but the version in the DB should be accurate already. Changing the files here will only make sure that future ingests of the same data use the correct input format.

gadomski commented 1 year ago

need to find the ref where @jsignell worked on a pystac fix

The issue is https://github.com/stac-utils/pystac/issues/1044 but there isn't a fix, and we never opened the tracking issue mentioned in https://github.com/stac-utils/pystac/issues/1044#issuecomment-1506022804. There's some larger issues around extensions (https://github.com/stac-utils/pystac/issues/1051, https://github.com/stac-utils/pystac/issues/448) and serialization (https://github.com/stac-utils/pystac/issues/1092) in pystac, so a correction probably won't come from there anytime soon.