Closed tloubrieu-jpl closed 1 month ago
@tloubrieu-jpl could I bug you for more details?
Example document/fields
node: psa-prod
id: urn:esa:psa:em16_tgo_cas:data_raw:cas_raw_sc_20230309t163649-20230309t163653-23606-60-blu-1122087748-29-3::3.0
{
"pds:Axis_Array/pds:sequence_number": [
"1",
"2"
],
"em16_tgo_cas:Detector_Information/em16_tgo_cas:pixel_width": [
"10.0"
]
}
Of note is the fact that this document may not have had repairkit
ever run on it - it lacks version metadata for any sweeper - though it's possible a pre-versioning repairkit
may have been run on it.
Simplest way to confirm is probably just to write a test and run it against various versions (current, and all pre-versioning versions) of the repairkit sweeper.
If it isn't due to a bug in repairkit
, then it seems like harvest
may not be storing values correctly, assuming that the field metadata in the DD is present and correct.
@alexdunnjpl 👍 . not sure where this bug is. that being said, there is definitely some funkiness happening across the schemas based upon this warning I have seen from our EN cross-cluster search:
This is from the Index Patterns page where it is trying to bring all the indexes fields together, and it is colliding with their types. I imagine something went weird here with past versions of Harvest (hopefully not the current), and we just need to do a scrub of the types.
@tloubrieu-jpl ☝️
@tloubrieu-jpl @jordanpadams I'm softly skeptical that this is new/erroneous behaviour and that OpenSearch was intended to use float/int types when storing numerically-defined fields.
repairkit
does not appear to be responsible for this behaviour (based on visual check of history - tests will likely confirm)repairkit
is (and has always) been limited to converting "string" to "list with a single string element" rather than "nonlist type T" to "list with a single T element", though this could have been an oversight in the initial implementationkeyword
is inherently string-valued - can you even store numeric types in these fields? Need to confirm empirically if nobody can answer off the dome.@jordanpadams if you're seeing collisions, does that suggest that some field is differently-typed across different nodes? If so, that would suggest that at some point, harvest changed from (I assume) varied field types to keyword
.
Thoughts?
Hi @alexdunnjpl ,
Thanks for looking at that.
My basis for thinking we had a regression on having all numbers as string is that in the past (approximately 2022, maybe later), we had integration tests which were testing filter on number, see for example {{baseUrl}}/collections?limit=10&q=_file_size ne 8042
.
Then I removed these tests when I transitioned the integration test to the I&T team so that we re-start from something less chaotic than how the test suite had turned in. So I am not sure when the number stoped being numbers.
So you are saying that none of the values are stored as numbers in OpenSearch. Does the schema confirms that ? Is the current version of harvest pushing the documents with number converted as strings ?
Thanks
Sorry I think @alexdunnjpl you already answered my question on harvest not converting values to string.
This ElasticSearch issue seems to indicate that while numeric values are stored as-is in _source
, they are indexed as strings if typed as keyword
.
It seems fraught to intentionally write numerics to keyword-typed fields, on that basis alone.
@tloubrieu-jpl going the other direction, could you give an example of a node, document and numeric field which is in OpenSearch as a numeric type rather than a string?
Regarding the query in the tests, I'm not familiar with the internals of our query-language parsing, but I'd expect that OpenSearch will happily take a numeric type in a query and cast it when doing a comparison, if necessary. I'll test that assumption now.
Confirmed: OS will accept a numeric to compare with a keyword
field, convert it to a string, and perform a lexicographic comparison (resulting in erroneous results, for example finding that 9.0 > "10.0"
)
So it looks like there are two potential issues:
keyword
, when this is inappropriatekeyword
but written as int/float, they should still be returned as int/float in _source
, per that ES issue, if the info is still current)Looking at the psa-prod
index, I'm seeing plenty of numeric fields, e.g.
"cart:Equirectangular/cart:latitude_of_projection_origin": {
"type": "double"
},
"cart:Equirectangular/cart:longitude_of_central_meridian": {
"type": "double"
},
"cart:Equirectangular/cart:standard_parallel_1": {
"type": "double"
},
"cart:Geo_Transformation/cart:upperleft_corner_x": {
"type": "double"
},
"cart:Geo_Transformation/cart:upperleft_corner_y": {
"type": "double"
},
"cart:Geodetic_Model/cart:a_axis_radius": {
"type": "double"
},
"cart:Geodetic_Model/cart:b_axis_radius": {
"type": "double"
},
"cart:Geodetic_Model/cart:c_axis_radius": {
"type": "double"
},
but the field I'm using as a test example is keyword:
"em16_tgo_cas:Detector_Information/em16_tgo_cas:pixel_width": {
"type": "keyword"
},
so either:
cart*
properties were indexed with an older version of harvest
, and newer versions of harvest switched to using keyword
for everythingin either case, seems necessary to
Suggest moving this ticket to harvest
@alexdunnjpl @jordanpadams I moved the ticket to the harvest repository, from what was said before I suggest 2 actions:
The properties which we expect to be number are:
"pds:Record_Character.pds:record_length" : [ "110" ],
"pds:File.pds:file_size" : [ "3850" ],
"pds:Table_Character.pds:records" : [ "35" ],
As seen in request https://pds.nasa.gov/api/search/1/products
@tloubrieu-jpl @jordanpadams I'm trying to inspect the relevant index, but I'm not having much luck tracking it down.
The first product returned by that request is urn:nasa:pds:compil-comet:nuc_properties:extinct::1.0
, which has "ops:Harvest_Info.ops:node_name": "PDS_SBN". Directly inspecting the OpenSearch
registryindex mappings for
search-sbnpsi-prodand
search-sbnumb-prod` yields none of those three properties. Is it just a bad assumption on my part that that's where they'd be?
Is there a way to quickly determine which OS node is serving a particular document?
@alexdunnjpl I have no idea what is going on here... Will try out all the registries. not sure how this is possible
It looks like the collection is there, but the product is not...
@alexdunnjpl actually I lied.
curl -u <user> 'https://search-sbnumd-prod-o5i2rnn265gnwmv2quk4n7uram.us-west-2.es.amazonaws.com/registry/_search?q={_id:"urn:nasa:pds:compil-comet:nuc_properties:extinct::1.0"}&pretty=true'
Sorry @jordanpadams - didn't mean to send you off to do the thing I was lazily avoiding!
So the issue with the "missing" property mappings was that I forgot to convert them .
->/
:man_facepalming:
The good-ish news is that those are all mapped as long
in the index.
More good-ish news.
Searches appear to respect the mapping type, i.e. when performing comparisons the values aren't compared lexicographically and therefore aren't returning erroneous results.
ex. this query
{
"query": {
"bool": {
"must": [
{
"range": {
"pds:File/pds:records": {
"gte": 0,
"lte": 35
}
}
}
]
}
},
"_source": {
"includes": ["pds:File/pds:records"]
},
"size": 3,
"track_total_hits": true
}
returns the following (note presence of value "7"
)
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 150,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "registry",
"_type": "_doc",
"_id": "urn:nasa:pds:compil-comet:nuc_properties:extinct::1.0",
"_score": 1.0,
"_source": {
"pds:File/pds:records": [
"35"
]
}
},
{
"_index": "registry",
"_type": "_doc",
"_id": "urn:nasa:pds:gbo-kpno:mosaic-9p:reduced_2005_july0_conv_stand_july04_tab::1.0",
"_score": 1.0,
"_source": {
"pds:File/pds:records": [
"7"
]
}
},
{
"_index": "registry",
"_type": "_doc",
"_id": "urn:nasa:pds:gbo-kpno:mosaic-9p:reduced_200_standards_centers_july04_tab::1.0",
"_score": 1.0,
"_source": {
"pds:File/pds:records": [
"31"
]
}
}
]
}
}
Testing via the API appears to indicate that comparisons are being correctly performed:
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products' --data-urlencode 'limit=0' --data-urlencode 'q=(pds:File.pds:records gt "100")' --data-urlencode 'start=0' -H 'accept: application/json' | jq .summary.hits
1194175
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products' --data-urlencode 'limit=0' --data-urlencode 'q=(pds:File.pds:records gt "1000")' --data-urlencode 'start=0' -H 'accept: application/json' | jq .summary.hits
501838
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products' --data-urlencode 'limit=0' --data-urlencode 'q=(pds:File.pds:records gt "10000")' --data-urlencode 'start=0' -H 'accept: application/json' | jq .summary.hits
150101
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products' --data-urlencode 'limit=0' --data-urlencode 'q=(pds:File.pds:records gt "100000")' --data-urlencode 'start=0' -H 'accept: application/json' | jq .summary.hits
19194
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products' --data-urlencode 'limit=0' --data-urlencode 'q=(pds:File.pds:records gt "1000000")' --data-urlencode 'start=0' -H 'accept: application/json' | jq .summary.hits
219
~$ curl -s --get 'https://pds.nasa.gov/api/search-en/1/products' --data-urlencode 'limit=0' --data-urlencode 'q=(pds:File.pds:records gt "10000000")' --data-urlencode 'start=0' -H 'accept: application/json' | jq .summary.hits
11
Note that records gt
threshold is increasing tenfold each time, and hits count is being reduced drastically in a manner inconsistent with lexicographic comparison, which should result in fairly consistent hit counts.
@tloubrieu-jpl @jordanpadams with this in mind, what disconnect, if any, still exists between current and desired behaviour, and is it a regression, or a new feature (if that's known for certain)?
@alexdunnjpl, @tloubrieu-jpl may have a specific instance where this was breaking, but, in general, I think this is very tightly coupled with https://github.com/NASA-PDS/registry/issues/230. For this particular case, it may have been something we broke in the schema during re-indexing or repair kit or ???, but, in general, I think we may need a way to validate our OpenSearch schema types match the expectations of our released information model schemas, and repair those schema fields where there is a mismatch.
I will try to poke through Kibana to identify where OpenSearch sees a mismatch in schema types across all the clusters.
Roger that - thanks Jordan!
@alexdunnjpl per my comment on our Sprint Tag, it doesn't look like there is a way for me to look this up. I just see the screenshot from above that there is something out there.
Is there any way we can just develop a generic sweeper that goes through fields and checks their type matches what we have in the schema types?
@jordanpadams sure can! You just want detection/report, or with automatic sync and re-index?
What should the source of truth be for mapping fields onto their correct opensearch index mapping type?
@alexdunnjpl the source of truth is the JSON files online (e.g. this one). see Harvest for how it gets those files.
I'm thinking this ticket may actually be 2 parts:
@alexdunnjpl @tloubrieu-jpl have we figured out what is going on here? are we good to close this out or is there still work to be done?
@jordanpadams implementation of your two items in the previous comment are still outstanding.
We can close this ticket iff tickets exist for those two items, otherwise this should stay open. Probably best to split into new tickets since 1 will take much less time than 2.
@jordanpadams the registry-mgr dd
family of commands are totally orthogonal to this functionality, yeah?
@alexdunnjpl correct. I don't think this is really a registry-mgr or harvest thing to date. They both work with LDDs and trying to strongly type things in the schema, but nothing specifically dealing with manipulating or reviewing the schema after the fact
created NASA-PDS/registry-sweepers#128 and NASA-PDS/registry-sweepers#129. closing this one.
💡 Description
⚔️ Parent Epic / Related Tickets
No response