NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

Values in registry that should be numbers/integers are indexed as strings in JSON API response #153

Closed tloubrieu-jpl closed 1 month ago

tloubrieu-jpl commented 9 months ago

💡 Description

⚔️ Parent Epic / Related Tickets

No response

alexdunnjpl commented 7 months ago

@tloubrieu-jpl could I bug you for more details?

alexdunnjpl commented 7 months ago

Example document/fields node: psa-prod id: urn:esa:psa:em16_tgo_cas:data_raw:cas_raw_sc_20230309t163649-20230309t163653-23606-60-blu-1122087748-29-3::3.0

{
  "pds:Axis_Array/pds:sequence_number": [
    "1",
    "2"
  ],
  "em16_tgo_cas:Detector_Information/em16_tgo_cas:pixel_width": [
    "10.0"
  ]
}
alexdunnjpl commented 7 months ago

Of note is the fact that this document may not have had repairkit ever run on it - it lacks version metadata for any sweeper - though it's possible a pre-versioning repairkit may have been run on it.

Simplest way to confirm is probably just to write a test and run it against various versions (current, and all pre-versioning versions) of the repairkit sweeper.

If it isn't due to a bug in repairkit, then it seems like harvest may not be storing values correctly, assuming that the field metadata in the DD is present and correct.

jordanpadams commented 7 months ago

@alexdunnjpl 👍 . not sure where this bug is. that being said, there is definitely some funkiness happening across the schemas based upon this warning I have seen from our EN cross-cluster search:

Screenshot 2024-02-02 at 12 18 25 PM

This is from the Index Patterns page where it is trying to bring all the indexes fields together, and it is colliding with their types. I imagine something went weird here with past versions of Harvest (hopefully not the current), and we just need to do a scrub of the types.

@tloubrieu-jpl ☝️

alexdunnjpl commented 7 months ago

@tloubrieu-jpl @jordanpadams I'm softly skeptical that this is new/erroneous behaviour and that OpenSearch was intended to use float/int types when storing numerically-defined fields.

@jordanpadams if you're seeing collisions, does that suggest that some field is differently-typed across different nodes? If so, that would suggest that at some point, harvest changed from (I assume) varied field types to keyword.

Thoughts?

tloubrieu-jpl commented 7 months ago

Hi @alexdunnjpl ,

Thanks for looking at that. My basis for thinking we had a regression on having all numbers as string is that in the past (approximately 2022, maybe later), we had integration tests which were testing filter on number, see for example {{baseUrl}}/collections?limit=10&q=_file_size ne 8042.

Then I removed these tests when I transitioned the integration test to the I&T team so that we re-start from something less chaotic than how the test suite had turned in. So I am not sure when the number stoped being numbers.

So you are saying that none of the values are stored as numbers in OpenSearch. Does the schema confirms that ? Is the current version of harvest pushing the documents with number converted as strings ?

Thanks

tloubrieu-jpl commented 7 months ago

Sorry I think @alexdunnjpl you already answered my question on harvest not converting values to string.

alexdunnjpl commented 7 months ago

This ElasticSearch issue seems to indicate that while numeric values are stored as-is in _source, they are indexed as strings if typed as keyword.

It seems fraught to intentionally write numerics to keyword-typed fields, on that basis alone.

@tloubrieu-jpl going the other direction, could you give an example of a node, document and numeric field which is in OpenSearch as a numeric type rather than a string?

Regarding the query in the tests, I'm not familiar with the internals of our query-language parsing, but I'd expect that OpenSearch will happily take a numeric type in a query and cast it when doing a comparison, if necessary. I'll test that assumption now.

alexdunnjpl commented 7 months ago

Confirmed: OS will accept a numeric to compare with a keyword field, convert it to a string, and perform a lexicographic comparison (resulting in erroneous results, for example finding that 9.0 > "10.0")

So it looks like there are two potential issues:

Looking at the psa-prod index, I'm seeing plenty of numeric fields, e.g.

"cart:Equirectangular/cart:latitude_of_projection_origin": {
                    "type": "double"
                },
                "cart:Equirectangular/cart:longitude_of_central_meridian": {
                    "type": "double"
                },
                "cart:Equirectangular/cart:standard_parallel_1": {
                    "type": "double"
                },
                "cart:Geo_Transformation/cart:upperleft_corner_x": {
                    "type": "double"
                },
                "cart:Geo_Transformation/cart:upperleft_corner_y": {
                    "type": "double"
                },
                "cart:Geodetic_Model/cart:a_axis_radius": {
                    "type": "double"
                },
                "cart:Geodetic_Model/cart:b_axis_radius": {
                    "type": "double"
                },
                "cart:Geodetic_Model/cart:c_axis_radius": {
                    "type": "double"
                },

but the field I'm using as a test example is keyword:

                "em16_tgo_cas:Detector_Information/em16_tgo_cas:pixel_width": {
                    "type": "keyword"
                },

so either:

in either case, seems necessary to

alexdunnjpl commented 7 months ago

Suggest moving this ticket to harvest

tloubrieu-jpl commented 7 months ago

@alexdunnjpl @jordanpadams I moved the ticket to the harvest repository, from what was said before I suggest 2 actions:

  1. check that the latest version harvest is not breaking the schema
  2. develop a utility to make the schema straight, could include a feature for the ticket https://github.com/NASA-PDS/registry-api/issues/351. Ths development could be a sub-script of registry-manager or registry-sweeper, my first thought was to make it part of registry-manager/registry-common.
tloubrieu-jpl commented 7 months ago

The properties which we expect to be number are:

      "pds:Record_Character.pds:record_length" : [ "110" ],
      "pds:File.pds:file_size" : [ "3850" ],
      "pds:Table_Character.pds:records" : [ "35" ],

As seen in request https://pds.nasa.gov/api/search/1/products

alexdunnjpl commented 7 months ago

@tloubrieu-jpl @jordanpadams I'm trying to inspect the relevant index, but I'm not having much luck tracking it down.

The first product returned by that request is urn:nasa:pds:compil-comet:nuc_properties:extinct::1.0, which has "ops:Harvest_Info.ops:node_name": "PDS_SBN". Directly inspecting the OpenSearchregistryindex mappings forsearch-sbnpsi-prodandsearch-sbnumb-prod` yields none of those three properties. Is it just a bad assumption on my part that that's where they'd be?

Is there a way to quickly determine which OS node is serving a particular document?

jordanpadams commented 7 months ago

@alexdunnjpl I have no idea what is going on here... Will try out all the registries. not sure how this is possible

jordanpadams commented 7 months ago

It looks like the collection is there, but the product is not...

jordanpadams commented 7 months ago

@alexdunnjpl actually I lied.

curl -u <user> 'https://search-sbnumd-prod-o5i2rnn265gnwmv2quk4n7uram.us-west-2.es.amazonaws.com/registry/_search?q={_id:"urn:nasa:pds:compil-comet:nuc_properties:extinct::1.0"}&pretty=true'
alexdunnjpl commented 7 months ago

Sorry @jordanpadams - didn't mean to send you off to do the thing I was lazily avoiding!

So the issue with the "missing" property mappings was that I forgot to convert them .->/ :man_facepalming:

The good-ish news is that those are all mapped as long in the index.

alexdunnjpl commented 7 months ago

More good-ish news.

Searches appear to respect the mapping type, i.e. when performing comparisons the values aren't compared lexicographically and therefore aren't returning erroneous results.

ex. this query

{
    "query": {
        "bool": {
            "must": [
                {
                    "range": {
                        "pds:File/pds:records": {
                            "gte": 0,
                            "lte": 35
                        }
                    }
                }
            ]
        }
    },
    "_source": {
        "includes": ["pds:File/pds:records"]
    },
    "size": 3,
    "track_total_hits": true
}

returns the following (note presence of value "7")

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 150,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "registry",
                "_type": "_doc",
                "_id": "urn:nasa:pds:compil-comet:nuc_properties:extinct::1.0",
                "_score": 1.0,
                "_source": {
                    "pds:File/pds:records": [
                        "35"
                    ]
                }
            },
            {
                "_index": "registry",
                "_type": "_doc",
                "_id": "urn:nasa:pds:gbo-kpno:mosaic-9p:reduced_2005_july0_conv_stand_july04_tab::1.0",
                "_score": 1.0,
                "_source": {
                    "pds:File/pds:records": [
                        "7"
                    ]
                }
            },
            {
                "_index": "registry",
                "_type": "_doc",
                "_id": "urn:nasa:pds:gbo-kpno:mosaic-9p:reduced_200_standards_centers_july04_tab::1.0",
                "_score": 1.0,
                "_source": {
                    "pds:File/pds:records": [
                        "31"
                    ]
                }
            }
        ]
    }
}
alexdunnjpl commented 7 months ago

Testing via the API appears to indicate that comparisons are being correctly performed:

~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "100")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
1194175
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "1000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
501838
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "10000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
150101
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "100000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
19194
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "1000000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
219
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products'   --data-urlencode 'limit=0'   --data-urlencode 'q=(pds:File.pds:records gt "10000000")'   --data-urlencode 'start=0'   -H 'accept: application/json' | jq .summary.hits
11

Note that records gt threshold is increasing tenfold each time, and hits count is being reduced drastically in a manner inconsistent with lexicographic comparison, which should result in fairly consistent hit counts.

@tloubrieu-jpl @jordanpadams with this in mind, what disconnect, if any, still exists between current and desired behaviour, and is it a regression, or a new feature (if that's known for certain)?

jordanpadams commented 7 months ago

@alexdunnjpl, @tloubrieu-jpl may have a specific instance where this was breaking, but, in general, I think this is very tightly coupled with https://github.com/NASA-PDS/registry/issues/230. For this particular case, it may have been something we broke in the schema during re-indexing or repair kit or ???, but, in general, I think we may need a way to validate our OpenSearch schema types match the expectations of our released information model schemas, and repair those schema fields where there is a mismatch.

I will try to poke through Kibana to identify where OpenSearch sees a mismatch in schema types across all the clusters.

alexdunnjpl commented 7 months ago

Roger that - thanks Jordan!

jordanpadams commented 7 months ago

@alexdunnjpl per my comment on our Sprint Tag, it doesn't look like there is a way for me to look this up. I just see the screenshot from above that there is something out there.

Is there any way we can just develop a generic sweeper that goes through fields and checks their type matches what we have in the schema types?

alexdunnjpl commented 7 months ago

@jordanpadams sure can! You just want detection/report, or with automatic sync and re-index?

What should the source of truth be for mapping fields onto their correct opensearch index mapping type?

jordanpadams commented 7 months ago

@alexdunnjpl the source of truth is the JSON files online (e.g. this one). see Harvest for how it gets those files.

I'm thinking this ticket may actually be 2 parts:

  1. Functionality to output a report
  2. Functionality to force update the field types using registry manager and the JSON schema needed to update those fields?
jordanpadams commented 6 months ago

@alexdunnjpl @tloubrieu-jpl have we figured out what is going on here? are we good to close this out or is there still work to be done?

alexdunnjpl commented 6 months ago

@jordanpadams implementation of your two items in the previous comment are still outstanding.

We can close this ticket iff tickets exist for those two items, otherwise this should stay open. Probably best to split into new tickets since 1 will take much less time than 2.

alexdunnjpl commented 6 months ago

@jordanpadams the registry-mgr dd family of commands are totally orthogonal to this functionality, yeah?

jordanpadams commented 6 months ago

@alexdunnjpl correct. I don't think this is really a registry-mgr or harvest thing to date. They both work with LDDs and trying to strongly type things in the schema, but nothing specifically dealing with manipulating or reviewing the schema after the fact

jordanpadams commented 1 month ago

created NASA-PDS/registry-sweepers#128 and NASA-PDS/registry-sweepers#129. closing this one.