Open jsignell opened 1 year ago
@kathrynberger βοΈ we saw that this morning too (it made stacstack fails)
Hi ! I am encountering the same problem in the MAAP STAC catalog that uses a fork of this repo for data publication. Here is an example (if you check the raw data, you can see epsg is a float).
I have been looking at where this could be happening in the pipeline, without success so far.
Just a note that I am not at all sure that the source of this is within this repo. It seems possible that this is coming from data-pipelines or from rio-stac or even from rasterio itself.
We found out with @wildintellect that this bug is also in the AWS ASDI STAC catalog, which means the bug doesn't have to do with VEDA specific code. It might be happening at the ingestion step, I am going to dig into that today.
For the context again, we read metadata from a COG and turn that to a stac item using rio-stac
, then we give this as an input to an ingestor that loads that into the STAC catalog database. The problem is that when we query that database after that, one of the properties (proj:epsg
) is a float whereas it should be an int.
So, there are a couple places where this could come from. I have a check mark where I verified the problem does not come from there :
rio-stac
)veda-ingestor
)pgstac
)I confirmed it's not the first by testing this (which relies on rio-stac), and that it's not the last by verifying that a direct SQL query to the database does return a float.
Maybe it's in the ingestor? AWS ASDI pipelines and MAAP use the same ingestor code. Although it's wrapped in different ways (the former calls it from cdk-pgstac), the ingestion code is the same. However, I did as a first attempt run the ingestor unit tests locally (swapping the example item with mine) and I could not reproduce the problem (proj:epsg type was preserved).
Although I do see some lines of code that could cause this, where type conversions are happening, I tried to run these locally and again could not reproduce the problem.
Actually, the ingestor unit tests aren't even using a pgstac database.
I turned to looking at the database code itself (pgstac
). Queried some details about the structure of the data : the item in itself has the type jsonb
(binary json) in the database, and within it, proj:epsg
is of type number
. The two numeric types in JSON are number
and integer
, but I find it a little confusing in the documentation that the number
type also accepts 'integers' βIt looks like what could be happening is that when the data enters or exits the database we force all the numbers to be of type number
that accepts e.g. 34506.0
as an integer.
I digged into the pypgstac
unit tests, which have an item example with this exact proj:epsg
field. Added some print statements in the test_hydrate
, test_dehydrate
ones, and it seems that the integer type is preserved. Looks like I am blocked for now π€
Pgstac should be storing those properties exactly as they were in the data that was ingested. The pgstac search function does try to be forgiving of mismatched types and should still match up searches for filters like proj:epsg=4326, proj:epsg='4326' and proj:epsg=4326.0, but the data returned will be exactly what had been ingested.
Pgstac does not do any validation either on the way in or the way out.
This issue is being caused by the use of DynamoDB as an intermediate storage layer in the ingestion process. When a Python Dict
is loaded into DynamoDB via Boto3 the native Python types are converted to DynamoDB's more limited types. When the Ingestion
is later deserialized for insertion into pgstac
the Ingestion
's item
property should be deserialized using the Item
class's typing.
However, this does not account for STAC extensions (like the projection extension who's epsg
property is an int
) included in the item
's structure. To handle this it appears that a utility function is used to convert DynamoDB decimal
values to native Python floats
but this type conversion logic is incomplete so decimal
values that should be converted to native Python ints
are not.
I believe some more detailed logic to handle a wider range of type conversion is necessary, something like
def replace_decimals(obj):
if isinstance(obj, list):
for i in xrange(len(obj)):
obj[i] = replace_decimals(obj[i])
return obj
elif isinstance(obj, dict):
for k in obj.iterkeys():
obj[k] = replace_decimals(obj[k])
return obj
elif isinstance(obj, decimal.Decimal):
if obj % 1 == 0:
return int(obj)
else:
return float(obj)
else:
return obj
cc @alukach
Would it make sense / be more reliable to just save the item in dynamo just as text rather than adding the lossy serialize/deserialize step?
On Mon, Mar 20, 2023 at 12:11β―PM Sean Harkins @.***> wrote:
This issue is being caused by the use of DynamoDB as an intermediate storage layer in the ingestion process. When Python [Dicts] are loaded into DynamoDB via Boto3 the native Python types are converted to DynamoDB's more limited types https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/dynamodb.html#valid-dynamodb-types. When the Ingestion is later deserialized https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/ingestor.py#L44-L46 for insertion into pgstac the Ingestion's item property should be deserialized using the Item class's typing.
However, this does not account for STAC extensions (like the projection extension https://github.com/stac-extensions/projection who's epsg property is an int) included in the item's structure. To handle this it appears that a utility function is used to convert DynamoDB decimal values to native Python floats but this type conversion logic is incomplete so decimal values that should be converted to native Python ints are not.
I believe some more detailed logic to handle a wider range of type conversion is necessary, something like
def replace_decimals(obj): if isinstance(obj, list): for i in xrange(len(obj)): obj[i] = replace_decimals(obj[i]) return obj elif isinstance(obj, dict): for k in obj.iterkeys(): obj[k] = replace_decimals(obj[k]) return obj elif isinstance(obj, decimal.Decimal): if obj % 1 == 0: return int(obj) else: return float(obj) else: return obj
β Reply to this email directly, view it on GitHub https://github.com/NASA-IMPACT/veda-data/issues/57, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABIHXBJOHD4AUR3CSAKFK3W5CFUTANCNFSM6AAAAAAV4C5CJI . You are receiving this because you commented.Message ID: @.***>
--
David William Bitner dbSpatial LLC 612-578-9553
I think @sharkinsspatial accurately captured the issue. Looking at the Boto3 codebase, we can see that it deserializes any number into a decimal, which is done within our codebase when we retrieve the item from the DynamoDB stream (just before writing it to pgSTAC): https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/4032ea5ebfebb32ddadea06010685e81b74611f4/api/src/ingestor.py#L43-L56
The VEDA-STAC-Ingestor codebase is littered with patches for working around DynamoDB's Decimal datatype:
I agree that @bitner that we should just be storing the STAC item in Dynamo as a string (ie json.dumps(item)
). We lose some abilities on DynamoDB that we could theoretically get by storing it as a Map
(e.g. filtering by item attributes) but that's not something we ever do.
The simplest way to serialize/deserialize the Ingetsion.item
attribute may be to add a custom Pydantic model validator or helper methods to the model. Alternatively, we could shoehorn that in when we read/write to DynamoDB.
Either way, this would allow us to remove a lot of the DynamoDB type handling cruft that we currently have throughout the service.
I think the most straightforward serialization method here would be to pass a custom encoder to the .json call that dumps
the item
property.
The deeper question for @bitner will be how we should retroactively update the jsonb
values in the pgstac
existing pgstac
instance where values defined as int
in the relevant jsonschema have been stored stored with a trailing .0
.
At a high semantic level, the issue @jsignell is reporting is not actually a problem with how the epsg
value is stored or provided from pgstac
as according to jsonschema 4326.0
is a valid int
representation. I think the main issue is that pystac
is not correctly coercing this representation to Python int
so we beyond changing our storage here, we should address this upstream issue.
π Looks like @jsignell is already on top of this βοΈ ref https://github.com/stac-utils/pystac/issues/1044
@sharkinsspatial, looks like you are working on a fix in the ingestor side, am I right ? https://github.com/developmentseed/cdk-pgstac/pull/26
@emileten I was putting together a fix but got side tracked a bit yesterday. I need to add some more comprehensive tests (there don't seem to be any for the ingestor
code paths) and I'll finalize the PR π
@emileten We put together a fix for this in cdk-pgstac that you can likely back port here. Please note that I did experience issues with the collection summary update https://github.com/NASA-IMPACT/veda-backend/blob/develop/database/runtime/handler.py#L341 failing in integration tests, but as discussed in Slack we are going to remove this from cdk-pgstac
in a subsequent PR in favor of the summary update mechanism in pgstac==0.7.0
.
A very heavy-handed approach for back fixing data already in pgstac that should fix any instances of extraneous decimal places and would likely be faster than other options would be to just use a regular expression.
UPDATE items SET content =
jsonb_set(
content,
'{properties}',
regexp_replace(content->>'properties', '[^\d](\d+)\.0+(?=[^\d])', '\1', 'g')::jsonb
)
WHERE
content->>'properties' ~ '[^\d]\d+\.0+[^\d]'
;
I would DEFINITELY try to test this first before running it across everything!
Just capturing my understanding of the code before I run it: it finds anything in properties that includes a .
followed by one or more trailing 0
's. It captures only integer part of the number and replaces any instance of the pattern with that integer. I'm going to try and run this against with a limit of 10 in the test database and see how long it takes to run as more items are included (I would expect time to run to scale linearly)...
Update: I made a slight modification to the code to time it and rollback which looks like this:
BEGIN;
EXPLAIN ANALYZE
WITH limited_items AS (
SELECT id
FROM items
WHERE content->>'properties' ~ '[^\d]\d+\.0+[^\d]'
LIMIT 10 -- Set the desired limit here
)
UPDATE items
SET content =
jsonb_set(
content,
'{properties}',
regexp_replace(content->>'properties', '[^\d](\d+)\.0+(?=[^\d])', '\1', 'g')::jsonb
)
FROM limited_items
WHERE items.id = limited_items.id;
ROLLBACK;
However when I run it I get this error:
ERROR: invalid input syntax for type json
DETAIL: Expected string, but found "57448".
CONTEXT: JSON data, line 1: ... "2016-03-31T23:5959Z", "proj:bbox": 533724,57448...
I'm not sure exactly what this means but I could imagine that 533724,57448
is not valid json since it is not in a list. Looking at the jsonb content itself, it appears that the data is not stored as a list in the jsonb, so why in this query is it returning not in a list?
More testing may be needed, but this code does appear to work to update the proj:epsg field specifically:
EXPLAIN ANALYZE
UPDATE items
SET content = jsonb_set(
content,
'{properties, proj:epsg}',
regexp_replace(
(content->'properties'->>'proj:epsg')::text,
'(\d+)\.0',
'\1',
'g'
)::jsonb
)
WHERE (content->'properties'->>'proj:epsg')::text ~ '(\d+)\.0';
Questions:
proj
fields. I ran in test and saw the items proj:epsg
for https://stac.test.maap-project.org/collections/AfriSAR_AGB_Maps_1681/items get updated but not sure if there are other collections in test I can check. (I should have checked this first - what collections actually have proj
fields.proj:shape
- it seems clear these should be ints. But what about proj:geometry?
- Do we need to account for more than one trailing zero?
I think no.
- Not a question, but we probably want to add that plus some sort of boundary (like a comma)
If we are only looking for one zero and only on specific fields that we know should be ints, then maybe we do not need a boundary
- What collections and fields do we need to check for (acknowledging that David Bitner's solution was supposed to address all of them)? I realized after the fact that I don't think all collections have these
proj
fields. I ran in test and saw the itemsproj:epsg
for https://stac.test.maap-project.org/collections/AfriSAR_AGB_Maps_1681/items get updated but not sure if there are other collections in test I can check. (I should have checked this first - what collections actually haveproj
fields.
Any item that has "https://stac-extensions.github.io/projection/v1.0.0/schema.json" in it's stac_extensions will have at least the proj:epsg
field. Maybe you could use the search API to filter for items where that extension is enabled? At first glance though it looks like 's1-rtc-seasonal-composite' is another collection that you could check.
- I noticed that another field that probably needs to be updated is
proj:shape
- it seems clear these should be ints. But what about proj:geometry?
I took a look at https://github.com/stac-extensions/projection/blob/main/json-schema/schema.json and it looks like proj:shape
should be an integer, but proj:geometry
contains numbers.
It looks like I need to change that regex_replace to use a look ahead assertion to deal with the list.
This version of the regex should just delete the .0 (or .00 etc) rather than replacing the subset with the first part of the number. It has a bit more specific treating of the boundaries as well.
Just to test the regex_replace --
SELECT regexp_replace('{"proj:epsg": 4326.0, "proj:bbox":[12345.1234,1234.0,1234.000]}'::jsonb::text, '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb ;
We are using a look behind assertion (?<=[ ,\[:"]\d+)
that says only find a match where there is any number of digits \d+
preceded by a space, comma, colon, or opening square bracket and a look ahead assertion (?=[ ,\]\}"])
that says that after the match it must be followed by a space, comma, closing square bracket, or closing curly bracket. The match \.0+
will match a period followed by any number of zeros. It is then going to replace the match which is just the decimal part of the number ".0" or ".000" with an empty string. (The earlier version tried to do this by extracting the integer part rather than by removing the decimal part). You could replace the \.0+
by \.0
if you only cared about a single trailing zero. This version will remove any number of trailing zeros, but do to the look ahead, it will not replace something like ".0001".
We are only going to want to run the update on rows that have the issue. This tests the regex to test if we need to make the fix. We don't need the look ahead or look behind assertions because we aren't trying to isolate just the matching part.
SELECT '{"proj:epsg": 4326.0, "proj:bbox":[12345.1234,1234.0,1234.000]}'::jsonb::text ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]';
SELECT '{"proj:epsg": 4326, "proj:bbox":[12345.1234,1234,1234]}'::jsonb::text ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]';
Putting these together to just double check that this works on items records, we can just run a select
SELECT regexp_replace(content->>'properties', '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb FROM items WHERE content->>'properties' ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]' LIMIT 100;
Note that we are using the "->>" operator for the json which returns text rather than json, so we do not need to additionally cast to text.
This will try to fix ANYTHING in the entire json. If we want to isolate to a single property, things get more complicated as we then need to use jsonb_set as you figured out already.
SELECT jsonb_set(content, '{properties, proj:epsg}', regexp_replace(content->'properties'->>'proj:epsg', '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb ) FROM items WHERE content->'properties'->>'proj:epsg' ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]' LIMIT 100;
Note how we are now using content->'properties'
rather than content->>'properties'
because we want the properties object and not the text representation.
In json, there is no distinction between ints and floats, there is only number, so this should be safe for all numeric fields as 4326.0 == 4326.
Putting this together to update:
UPDATE items SET content=regexp_replace(content->>'properties', '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb
WHERE content ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]';
or
UPDATE items SET content=jsonb_set(content, '{properties, proj:epsg}', regexp_replace(content->'properties'->>'proj:epsg', '(?<=[ ,\[:"]\d+)\.0+(?=[ ,\]\}"])', '', 'g')::jsonb )
FROM items WHERE content->'properties'->>'proj:epsg' ~ '[ ,\[:"]\d+\.0+[ ,\]\}"]';
This would then have to be run for each property that you wanted to specifically fix.
Thanks so much @bitner and @jsignell for your attention to this.
So there are generally 2 options:
proj:epsg
and proj:shape
) which should only be integers.I'm leaning toward (2) since I am more certain about what changes might be made by targeting specific fields. As we have noted, the only 2 fields identified so far which specifically should be integers and do not make sense as floats are proj:epsg and proj:shape.
Other fields may have floats with only trailing zeros, like bbox and geometry in the item spec and proj:geometry
in the projection extension spec. However, floats in those fields are still valid.
@emileten @jjfrench @jsignell @bitner are there any other things I should consider when choosing between (1) and (2)?
@abarciauskas-bgse As I mentioned in the internal Slack thread the other day, I would suggest option 2. Because you have ingested a limited range of items (with a known set of extensions) the most reliable approach would be
type
integer
specified in their jsonSchema
document.type
integer
check for their existence in the jsonb
and remove the .
and trailing zeros from the value.I have written some scripts. This turned out to be a bit more involved than I expected because of how some of the fields are nested, so I ended up writing functions for each specific field we need to update given the extensions we are using - which are projection and raster. I haven't cleaned it up and I've only run the queries on our MAAP test database so far but here is a link to the notebook: https://github.com/MAAP-Project/veda-data-pipelines/blob/ab/add-update-floats-script/scripts/floats-to-ints-with-sql.ipynb
@abarciauskas-bgse Here are some functions that actually traverse the json tree and will fix any path that has key in the array set at the bottom of the json_paths_to_fix function. The key in that array can be any key in the json path to get to the value. By creating the functions in the pg_temp schema, they are temporary functions that only exist for the current session.
To test: SELECT content, pg_temp.fix_jsonb_ints(content) FROM items LIMIT 10;
To use in an update: UPDATE items SET content=pg_temp.fix_jsonb_ints(content) WHERE content != pg_temp.fix_jsonb_ints(content);
Note that we want to limit the actual updates to where there actually are any changes. UPDATES that do nothing can actually be pretty bad in Postgres as they can lead to fragmentation in the underlying storage reducing performance.
CREATE OR REPLACE FUNCTION pg_temp.jsonb_paths_tofix (IN jdata jsonb, OUT path text[], OUT value jsonb) RETURNS
SETOF RECORD AS $$
with recursive extract_all as
(
select
ARRAY[key]::text[] as path,
value
FROM jsonb_each(jdata)
union all
select
path || coalesce(obj_key, (arr_key- 1)::text),
coalesce(obj_value, arr_value)
from extract_all
left join lateral
jsonb_each(case jsonb_typeof(value) when 'object' then value end)
as o(obj_key, obj_value)
on jsonb_typeof(value) = 'object'
left join lateral
jsonb_array_elements(case jsonb_typeof(value) when 'array' then value end)
with ordinality as a(arr_value, arr_key)
on jsonb_typeof(value) = 'array'
where obj_key is not null or arr_key is not null
)
SELECT
path,
to_jsonb((value->>0)::float)
FROM extract_all
WHERE
jsonb_typeof(value) = 'number'
AND (value->>0)::float::text != value->>0
AND path && '{proj:epsg,proj:shape,bits_per_sample,buckets}'::text[] -- Change this array to change what fields get fixed
;
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
CREATE OR REPLACE FUNCTION pg_temp.fix_jsonb_ints(IN jdata jsonb) RETURNS jsonb AS $$
DECLARE
rec record;
out jsonb := jdata;
BEGIN
FOR rec IN SELECT * FROM pg_temp.jsonb_paths_tofix(jdata)
LOOP
RAISE NOTICE 'SETTING % TO %', rec.path, rec.value;
out := jsonb_set(out, rec.path, rec.value);
END LOOP;
RETURN out;
END;
$$ LANGUAGE PLPGSQL IMMUTABLE PARALLEL SAFE;
Thank you @bitner ! Adding @omshinde as he is working with me on this task now and may want to test the approach in your comment above.
Just noting we may not implement the fix in MAAP as we are currently planning to re-run our data publication using cdk-pgstac which has a fix for this issue (so newly ingested items would have these integer fields properly stored as ints)
Update:
The notebook https://github.com/MAAP-Project/veda-data-pipelines/blob/ab/add-update-floats-script/scripts/floats-to-ints-with-sql.ipynb fixes the issue of shape
stored as floats
instead of ints
.
Based on testing on the ABLVIS1B
dataset.
shape
as floats
shape
as ints
after executing the notebook
I am yet to test the functions provided by @bitner above.
@omshinde thanks so much for testing this. For MAAP, we have updated our ingestor codebase so I don't think we need this fix anymore, right @emileten @jjfrench? I would check in with @wildintellect and @jjfrench if there is something else to help with!
Thanks @abarciauskas-bgse ! Yeah, we kinda discussed this during the last scrum meeting (yesterday). Apologies for missing out on updating it here.
For VEDA, fixing the data will follow fixing the ingest mechanism https://github.com/NASA-IMPACT/veda-backend/issues/172.
Based on the conversation yesterday, we are considering this issue blocked until https://github.com/NASA-IMPACT/veda-backend/issues/172 is in.
I was pointed this way via the new blog post from DevSeed, here: https://developmentseed.org/blog/2023-10-20-asdi/
The proj epsg codes as float made their way into the new AWS STAC Catalog, here: https://dev.asdi-catalog.org/, which means I can't load those items. Example here: https://gist.github.com/alexgleith/0d705d526b596b7a00fbfa589b5d3fd0
Just bumping as this is a tiny detail that has significant impact :-)
@jjfrench :point_up: did we have a ticket to fix the the AWS DEM collection? We must have worked around it for https://github.com/developmentseed/asdi-examples/blob/main/cop-dem/cop-dem-glo-30-stac-access.ipynb
1) projection extension schema version should be
v1.1.0
2) proj:epsg should be type int and proj:shape should be type List[int]Context
I was debugging an issue on VEDA where epsg was being read as a float and causing some issues.
I decided that maybe there was a mismatch between the STAC version and the extension version, so I pulled the file locally (using
wget
) and changed proj ext schema version from v1.0.0 to v1.1.0This time the proj ext is understood, but since "proj:epsg" is stored as a float in the STAC item json it is in the properties dict as a float.
I opened a ticket on pystac to track try to get type coersion based on extension support (https://github.com/stac-utils/pystac/issues/1044) but in this is still something that VEDA should correct in its own STAC.