Fix STAC `proj:` values and project extension schema for existing items

jsignell commented 1 year ago

1) projection extension schema version should be v1.1.0 2) proj:epsg should be type int and proj:shape should be type List[int]

Context

I was debugging an issue on VEDA where epsg was being read as a float and causing some issues.

import pystac
from pystac.extensions.projection import ProjectionExtension

item = pystac.Item.from_file("https://staging-stac.delta-backend.com/collections/nceo_africa_2017/items/AGB_map_2017v0m_COG")
print(ProjectionExtension.has_extension(item))  # False
print(item.properties["proj:epsg"])  # 4326.0

I decided that maybe there was a mismatch between the STAC version and the extension version, so I pulled the file locally (using wget) and changed proj ext schema version from v1.0.0 to v1.1.0

import pystac
from pystac.extensions.projection import ProjectionExtension

item = pystac.Item.from_file("AGB_map_2017v0m_COG")
print(ProjectionExtension.has_extension(item))  # True
print(item.properties["proj:epsg"])  # 4326.0

This time the proj ext is understood, but since "proj:epsg" is stored as a float in the STAC item json it is in the properties dict as a float.

I opened a ticket on pystac to track try to get type coersion based on extension support (https://github.com/stac-utils/pystac/issues/1044) but in this is still something that VEDA should correct in its own STAC.

vincentsarago commented 1 year ago

@kathrynberger ☝️ we saw that this morning too (it made stacstack fails)

emileten commented 1 year ago

Hi ! I am encountering the same problem in the MAAP STAC catalog that uses a fork of this repo for data publication. Here is an example (if you check the raw data, you can see epsg is a float).

I have been looking at where this could be happening in the pipeline, without success so far.

jsignell commented 1 year ago

Just a note that I am not at all sure that the source of this is within this repo. It seems possible that this is coming from data-pipelines or from rio-stac or even from rasterio itself.

emileten commented 1 year ago

We found out with @wildintellect that this bug is also in the AWS ASDI STAC catalog, which means the bug doesn't have to do with VEDA specific code. It might be happening at the ingestion step, I am going to dig into that today.

emileten commented 1 year ago

For the context again, we read metadata from a COG and turn that to a stac item using rio-stac, then we give this as an input to an ingestor that loads that into the STAC catalog database. The problem is that when we query that database after that, one of the properties (proj:epsg) is a float whereas it should be an int.

So, there are a couple places where this could come from. I have a check mark where I verified the problem does not come from there :

[x] when we read the tif and use the metadata to make a pystac item (=rio-stac)
[ ] when we hand over the json to the database code (= the veda-ingestor)
[ ] the database code in itself (= pgstac)
[x] the api code we use to query the database

I confirmed it's not the first by testing this (which relies on rio-stac), and that it's not the last by verifying that a direct SQL query to the database does return a float.

Maybe it's in the ingestor? AWS ASDI pipelines and MAAP use the same ingestor code. Although it's wrapped in different ways (the former calls it from cdk-pgstac), the ingestion code is the same. However, I did as a first attempt run the ingestor unit tests locally (swapping the example item with mine) and I could not reproduce the problem (proj:epsg type was preserved).

Although I do see some lines of code that could cause this, where type conversions are happening, I tried to run these locally and again could not reproduce the problem.

emileten commented 1 year ago

Actually, the ingestor unit tests aren't even using a pgstac database.

I turned to looking at the database code itself (pgstac). Queried some details about the structure of the data : the item in itself has the type jsonb (binary json) in the database, and within it, proj:epsg is of type number. The two numeric types in JSON are number and integer, but I find it a little confusing in the documentation that the number type also accepts 'integers' ❓It looks like what could be happening is that when the data enters or exits the database we force all the numbers to be of type number that accepts e.g. 34506.0 as an integer.

I digged into the pypgstac unit tests, which have an item example with this exact proj:epsg field. Added some print statements in the test_hydrate, test_dehydrate ones, and it seems that the integer type is preserved. Looks like I am blocked for now 🤔

bitner commented 1 year ago

Pgstac should be storing those properties exactly as they were in the data that was ingested. The pgstac search function does try to be forgiving of mismatched types and should still match up searches for filters like proj:epsg=4326, proj:epsg='4326' and proj:epsg=4326.0, but the data returned will be exactly what had been ingested.

Pgstac does not do any validation either on the way in or the way out.

sharkinsspatial commented 1 year ago

This issue is being caused by the use of DynamoDB as an intermediate storage layer in the ingestion process. When a Python Dict is loaded into DynamoDB via Boto3 the native Python types are converted to DynamoDB's more limited types. When the Ingestion is later deserialized for insertion into pgstac the Ingestion's item property should be deserialized using the Item class's typing.

However, this does not account for STAC extensions (like the projection extension who's epsg property is an int) included in the item's structure. To handle this it appears that a utility function is used to convert DynamoDB decimal values to native Python floats but this type conversion logic is incomplete so decimal values that should be converted to native Python ints are not.

I believe some more detailed logic to handle a wider range of type conversion is necessary, something like

def replace_decimals(obj):
    if isinstance(obj, list):
        for i in xrange(len(obj)):
            obj[i] = replace_decimals(obj[i])
        return obj
    elif isinstance(obj, dict):
        for k in obj.iterkeys():
            obj[k] = replace_decimals(obj[k])
        return obj
    elif isinstance(obj, decimal.Decimal):
        if obj % 1 == 0:
            return int(obj)
        else:
            return float(obj)
    else:
        return obj

cc @alukach

bitner commented 1 year ago

Would it make sense / be more reliable to just save the item in dynamo just as text rather than adding the lossy serialize/deserialize step?

On Mon, Mar 20, 2023 at 12:11 PM Sean Harkins @.***> wrote:

This issue is being caused by the use of DynamoDB as an intermediate storage layer in the ingestion process. When Python [Dicts] are loaded into DynamoDB via Boto3 the native Python types are converted to DynamoDB's more limited types https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/dynamodb.html#valid-dynamodb-types. When the Ingestion is later deserialized https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/ingestor.py#L44-L46 for insertion into pgstac the Ingestion's item property should be deserialized using the Item class's typing.

However, this does not account for STAC extensions (like the projection extension https://github.com/stac-extensions/projection who's epsg property is an int) included in the item's structure. To handle this it appears that a utility function is used to convert DynamoDB decimal values to native Python floats but this type conversion logic is incomplete so decimal values that should be converted to native Python ints are not.

I believe some more detailed logic to handle a wider range of type conversion is necessary, something like

def replace_decimals(obj): if isinstance(obj, list): for i in xrange(len(obj)): obj[i] = replace_decimals(obj[i]) return obj elif isinstance(obj, dict): for k in obj.iterkeys(): obj[k] = replace_decimals(obj[k]) return obj elif isinstance(obj, decimal.Decimal): if obj % 1 == 0: return int(obj) else: return float(obj) else: return obj

— Reply to this email directly, view it on GitHub https://github.com/NASA-IMPACT/veda-data/issues/57, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABIHXBJOHD4AUR3CSAKFK3W5CFUTANCNFSM6AAAAAAV4C5CJI . You are receiving this because you commented.Message ID: @.***>

--

David William Bitner dbSpatial LLC 612-578-9553

alukach commented 1 year ago

I think @sharkinsspatial accurately captured the issue. Looking at the Boto3 codebase, we can see that it deserializes any number into a decimal, which is done within our codebase when we retrieve the item from the DynamoDB stream (just before writing it to pgSTAC): https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/4032ea5ebfebb32ddadea06010685e81b74611f4/api/src/ingestor.py#L43-L56

The VEDA-STAC-Ingestor codebase is littered with patches for working around DynamoDB's Decimal datatype:

I agree that @bitner that we should just be storing the STAC item in Dynamo as a string (ie json.dumps(item)). We lose some abilities on DynamoDB that we could theoretically get by storing it as a Map (e.g. filtering by item attributes) but that's not something we ever do.

The simplest way to serialize/deserialize the Ingetsion.item attribute may be to add a custom Pydantic model validator or helper methods to the model. Alternatively, we could shoehorn that in when we read/write to DynamoDB.

Either way, this would allow us to remove a lot of the DynamoDB type handling cruft that we currently have throughout the service.

sharkinsspatial commented 1 year ago

I think the most straightforward serialization method here would be to pass a custom encoder to the .json call that dumps the item property.

The deeper question for @bitner will be how we should retroactively update the jsonb values in the pgstac existing pgstac instance where values defined as int in the relevant jsonschema have been stored stored with a trailing .0.

At a high semantic level, the issue @jsignell is reporting is not actually a problem with how the epsg value is stored or provided from pgstac as according to jsonschema 4326.0 is a valid int representation. I think the main issue is that pystac is not correctly coercing this representation to Python int so we beyond changing our storage here, we should address this upstream issue.

sharkinsspatial commented 1 year ago

😄 Looks like @jsignell is already on top of this ☝️ ref https://github.com/stac-utils/pystac/issues/1044

emileten commented 1 year ago

@sharkinsspatial, looks like you are working on a fix in the ingestor side, am I right ? https://github.com/developmentseed/cdk-pgstac/pull/26

sharkinsspatial commented 1 year ago

@emileten I was putting together a fix but got side tracked a bit yesterday. I need to add some more comprehensive tests (there don't seem to be any for the ingestor code paths) and I'll finalize the PR 👍

sharkinsspatial commented 1 year ago

@emileten We put together a fix for this in cdk-pgstac that you can likely back port here. Please note that I did experience issues with the collection summary update https://github.com/NASA-IMPACT/veda-backend/blob/develop/database/runtime/handler.py#L341 failing in integration tests, but as discussed in Slack we are going to remove this from cdk-pgstac in a subsequent PR in favor of the summary update mechanism in pgstac==0.7.0.

bitner commented 1 year ago

A very heavy-handed approach for back fixing data already in pgstac that should fix any instances of extraneous decimal places and would likely be faster than other options would be to just use a regular expression.

UPDATE items SET content = 
    jsonb_set(
        content,
        '{properties}',
       regexp_replace(content->>'properties', '[^\d](\d+)\.0+(?=[^\d])', '\1', 'g')::jsonb
    )
WHERE 
    content->>'properties' ~ '[^\d]\d+\.0+[^\d]'
;

I would DEFINITELY try to test this first before running it across everything!

abarciauskas-bgse commented 1 year ago

Just capturing my understanding of the code before I run it: it finds anything in properties that includes a . followed by one or more trailing 0's. It captures only integer part of the number and replaces any instance of the pattern with that integer. I'm going to try and run this against with a limit of 10 in the test database and see how long it takes to run as more items are included (I would expect time to run to scale linearly)...

abarciauskas-bgse commented 1 year ago

Update: I made a slight modification to the code to time it and rollback which looks like this:

BEGIN;

EXPLAIN ANALYZE
WITH limited_items AS (
    SELECT id
    FROM items
    WHERE content->>'properties' ~ '[^\d]\d+\.0+[^\d]'
    LIMIT 10 -- Set the desired limit here
)
UPDATE items
SET content = 
    jsonb_set(
        content,
        '{properties}',
        regexp_replace(content->>'properties', '[^\d](\d+)\.0+(?=[^\d])', '\1', 'g')::jsonb
    )
FROM limited_items
WHERE items.id = limited_items.id;

ROLLBACK;

However when I run it I get this error:

ERROR:  invalid input syntax for type json
DETAIL:  Expected string, but found "57448".
CONTEXT:  JSON data, line 1: ... "2016-03-31T23:5959Z", "proj:bbox": 533724,57448...

I'm not sure exactly what this means but I could imagine that 533724,57448 is not valid json since it is not in a list. Looking at the jsonb content itself, it appears that the data is not stored as a list in the jsonb, so why in this query is it returning not in a list?

abarciauskas-bgse commented 1 year ago

More testing may be needed, but this code does appear to work to update the proj:epsg field specifically:


EXPLAIN ANALYZE
UPDATE items
SET content = jsonb_set(
    content,
    '{properties, proj:epsg}',
    regexp_replace(
        (content->'properties'->>'proj:epsg')::text,
        '(\d+)\.0',
        '\1',
        'g'
    )::jsonb
)
WHERE (content->'properties'->>'proj:epsg')::text ~ '(\d+)\.0';

Questions:

Do we need to account for more than one trailing zero?
Not a question, but we probably want to add that plus some sort of boundary (like a comma)
What collections and fields do we need to check for (acknowledging that David Bitner's solution was supposed to address all of them)? I realized after the fact that I don't think all collections have these proj fields. I ran in test and saw the items proj:epsg for https://stac.test.maap-project.org/collections/AfriSAR_AGB_Maps_1681/items get updated but not sure if there are other collections in test I can check. (I should have checked this first - what collections actually have proj fields.
I noticed that another field that probably needs to be updated is proj:shape - it seems clear these should be ints. But what about proj:geometry?
How should we update the dev database? What type of testing is necessary? This was very fast to run on the 1275 items in the test database.

jsignell commented 1 year ago

Do we need to account for more than one trailing zero?

I think no.

Not a question, but we probably want to add that plus some sort of boundary (like a comma)

If we are only looking for one zero and only on specific fields that we know should be ints, then maybe we do not need a boundary

What collections and fields do we need to check for (acknowledging that David Bitner's solution was supposed to address all of them)? I realized after the fact that I don't think all collections have these proj fields. I ran in test and saw the items proj:epsg for https://stac.test.maap-project.org/collections/AfriSAR_AGB_Maps_1681/items get updated but not sure if there are other collections in test I can check. (I should have checked this first - what collections actually have proj fields.

Any item that has "https://stac-extensions.github.io/projection/v1.0.0/schema.json" in it's stac_extensions will have at least the proj:epsg field. Maybe you could use the search API to filter for items where that extension is enabled? At first glance though it looks like 's1-rtc-seasonal-composite' is another collection that you could check.

I noticed that another field that probably needs to be updated is proj:shape - it seems clear these should be ints. But what about proj:geometry?

I took a look at https://github.com/stac-extensions/projection/blob/main/json-schema/schema.json and it looks like proj:shape should be an integer, but proj:geometry contains numbers.

bitner commented 1 year ago

It looks like I need to change that regex_replace to use a look ahead assertion to deal with the list.

This version of the regex should just delete the .0 (or .00 etc) rather than replacing the subset with the first part of the number. It has a bit more specific treating of the boundaries as well.

Just to test the regex_replace --

SELECT regexp_replace('{"proj:epsg": 4326.0, "proj:bbox":[12345.1234,1234.0,1234.000]}'::jsonb::text, '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb ;

We are using a look behind assertion (?<=[ ,\[:"]\d+) that says only find a match where there is any number of digits \d+ preceded by a space, comma, colon, or opening square bracket and a look ahead assertion (?=[ ,\]\}"]) that says that after the match it must be followed by a space, comma, closing square bracket, or closing curly bracket. The match \.0+ will match a period followed by any number of zeros. It is then going to replace the match which is just the decimal part of the number ".0" or ".000" with an empty string. (The earlier version tried to do this by extracting the integer part rather than by removing the decimal part). You could replace the \.0+ by \.0 if you only cared about a single trailing zero. This version will remove any number of trailing zeros, but do to the look ahead, it will not replace something like ".0001".

We are only going to want to run the update on rows that have the issue. This tests the regex to test if we need to make the fix. We don't need the look ahead or look behind assertions because we aren't trying to isolate just the matching part. SELECT '{"proj:epsg": 4326.0, "proj:bbox":[12345.1234,1234.0,1234.000]}'::jsonb::text ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]'; SELECT '{"proj:epsg": 4326, "proj:bbox":[12345.1234,1234,1234]}'::jsonb::text ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]';

Putting these together to just double check that this works on items records, we can just run a select SELECT regexp_replace(content->>'properties', '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb FROM items WHERE content->>'properties' ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]' LIMIT 100;

Note that we are using the "->>" operator for the json which returns text rather than json, so we do not need to additionally cast to text.

This will try to fix ANYTHING in the entire json. If we want to isolate to a single property, things get more complicated as we then need to use jsonb_set as you figured out already.

SELECT jsonb_set(content, '{properties, proj:epsg}', regexp_replace(content->'properties'->>'proj:epsg', '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb ) FROM items WHERE content->'properties'->>'proj:epsg' ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]' LIMIT 100;

Note how we are now using content->'properties' rather than content->>'properties' because we want the properties object and not the text representation.

In json, there is no distinction between ints and floats, there is only number, so this should be safe for all numeric fields as 4326.0 == 4326.

Putting this together to update:

UPDATE items SET content=regexp_replace(content->>'properties', '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb
WHERE content ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]';

or

UPDATE items SET content=jsonb_set(content, '{properties, proj:epsg}', regexp_replace(content->'properties'->>'proj:epsg', '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb ) 
FROM items WHERE content->'properties'->>'proj:epsg' ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]';

This would then have to be run for each property that you wanted to specifically fix.

abarciauskas-bgse commented 1 year ago

Thanks so much @bitner and @jsignell for your attention to this.

So there are generally 2 options:

Run a regex_replace which will look for all instances of trailing zeroes and remove them
Run a regex_replace on specific fields (proj:epsg and proj:shape) which should only be integers.

I'm leaning toward (2) since I am more certain about what changes might be made by targeting specific fields. As we have noted, the only 2 fields identified so far which specifically should be integers and do not make sense as floats are proj:epsg and proj:shape.

Other fields may have floats with only trailing zeros, like bbox and geometry in the item spec and proj:geometry in the projection extension spec. However, floats in those fields are still valid.

@emileten @jjfrench @jsignell @bitner are there any other things I should consider when choosing between (1) and (2)?

sharkinsspatial commented 1 year ago

@abarciauskas-bgse As I mentioned in the internal Slack thread the other day, I would suggest option 2. Because you have ingested a limited range of items (with a known set of extensions) the most reliable approach would be

Identify all extensions you have used which have a type integer specified in their jsonSchema document.
For each of the properties defined with type integer check for their existence in the jsonb and remove the . and trailing zeros from the value.

abarciauskas-bgse commented 1 year ago

I have written some scripts. This turned out to be a bit more involved than I expected because of how some of the fields are nested, so I ended up writing functions for each specific field we need to update given the extensions we are using - which are projection and raster. I haven't cleaned it up and I've only run the queries on our MAAP test database so far but here is a link to the notebook: https://github.com/MAAP-Project/veda-data-pipelines/blob/ab/add-update-floats-script/scripts/floats-to-ints-with-sql.ipynb

bitner commented 1 year ago

@abarciauskas-bgse Here are some functions that actually traverse the json tree and will fix any path that has key in the array set at the bottom of the json_paths_to_fix function. The key in that array can be any key in the json path to get to the value. By creating the functions in the pg_temp schema, they are temporary functions that only exist for the current session.

To test: SELECT content, pg_temp.fix_jsonb_ints(content) FROM items LIMIT 10;

To use in an update: UPDATE items SET content=pg_temp.fix_jsonb_ints(content) WHERE content != pg_temp.fix_jsonb_ints(content);

Note that we want to limit the actual updates to where there actually are any changes. UPDATES that do nothing can actually be pretty bad in Postgres as they can lead to fragmentation in the underlying storage reducing performance.

CREATE OR REPLACE FUNCTION pg_temp.jsonb_paths_tofix (IN jdata jsonb, OUT path text[], OUT value jsonb) RETURNS
SETOF RECORD AS $$
with recursive extract_all as
(
    select
        ARRAY[key]::text[] as path,
        value
    FROM jsonb_each(jdata)
union all
    select
        path || coalesce(obj_key, (arr_key- 1)::text),
        coalesce(obj_value, arr_value)
    from extract_all
    left join lateral
        jsonb_each(case jsonb_typeof(value) when 'object' then value end)
        as o(obj_key, obj_value)
        on jsonb_typeof(value) = 'object'
    left join lateral
        jsonb_array_elements(case jsonb_typeof(value) when 'array' then value end)
        with ordinality as a(arr_value, arr_key)
        on jsonb_typeof(value) = 'array'
    where obj_key is not null or arr_key is not null
)
SELECT 
    path,
    to_jsonb((value->>0)::float)
FROM extract_all
WHERE 
    jsonb_typeof(value) = 'number' 
    AND (value->>0)::float::text != value->>0
    AND path && '{proj:epsg,proj:shape,bits_per_sample,buckets}'::text[] -- Change this array to change what fields get fixed
;
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;

CREATE OR REPLACE FUNCTION pg_temp.fix_jsonb_ints(IN jdata jsonb) RETURNS jsonb AS $$
DECLARE
    rec record;
    out jsonb := jdata;
BEGIN
    FOR rec IN SELECT * FROM pg_temp.jsonb_paths_tofix(jdata)
    LOOP
        RAISE NOTICE 'SETTING % TO %', rec.path, rec.value;
        out := jsonb_set(out, rec.path, rec.value);
    END LOOP;
    RETURN out;
END;
$$ LANGUAGE PLPGSQL IMMUTABLE PARALLEL SAFE;

abarciauskas-bgse commented 1 year ago

Thank you @bitner ! Adding @omshinde as he is working with me on this task now and may want to test the approach in your comment above.

abarciauskas-bgse commented 1 year ago

Just noting we may not implement the fix in MAAP as we are currently planning to re-run our data publication using cdk-pgstac which has a fix for this issue (so newly ingested items would have these integer fields properly stored as ints)

omshinde commented 1 year ago

Update: The notebook https://github.com/MAAP-Project/veda-data-pipelines/blob/ab/add-update-floats-script/scripts/floats-to-ints-with-sql.ipynb fixes the issue of shape stored as floats instead of ints.

Based on testing on the ABLVIS1B dataset.

shape as floats

shape as ints after executing the notebook

I am yet to test the functions provided by @bitner above.

abarciauskas-bgse commented 1 year ago

@omshinde thanks so much for testing this. For MAAP, we have updated our ingestor codebase so I don't think we need this fix anymore, right @emileten @jjfrench? I would check in with @wildintellect and @jjfrench if there is something else to help with!

omshinde commented 1 year ago

Thanks @abarciauskas-bgse ! Yeah, we kinda discussed this during the last scrum meeting (yesterday). Apologies for missing out on updating it here.

j08lue commented 1 year ago

For VEDA, fixing the data will follow fixing the ingest mechanism https://github.com/NASA-IMPACT/veda-backend/issues/172.

jsignell commented 1 year ago

Based on the conversation yesterday, we are considering this issue blocked until https://github.com/NASA-IMPACT/veda-backend/issues/172 is in.

alexgleith commented 1 year ago

I was pointed this way via the new blog post from DevSeed, here: https://developmentseed.org/blog/2023-10-20-asdi/

The proj epsg codes as float made their way into the new AWS STAC Catalog, here: https://dev.asdi-catalog.org/, which means I can't load those items. Example here: https://gist.github.com/alexgleith/0d705d526b596b7a00fbfa589b5d3fd0

Just bumping as this is a tiny detail that has significant impact :-)

wildintellect commented 1 year ago

@jjfrench :point_up: did we have a ticket to fix the the AWS DEM collection? We must have worked around it for https://github.com/developmentseed/asdi-examples/blob/main/cop-dem/cop-dem-glo-30-stac-access.ipynb

NASA-IMPACT / veda-data

Fix STAC `proj:` values and project extension schema for existing items #57

Context