Open jordanpadams opened 1 month ago
Confirmed that document is in rms-registry
and contains relevant key and value.
Confirmed that the key is missing from rms-registry
_mapping
(and in fact there is no mapping for any attribute referencing "cassini".
Harvest date/time is 2022-06-28T23:09:48.274461Z, so my soft assumption is that this is the result of a bug or missing feature in harvest which has since been implemented.
I would suggest re-harvesting that product and re-testing to confirm that the expected entries are added to the index mappings.
@jordanpadams this will be pretty delicate and (computationally) expensive to fix with repairkit if it isn't a fairly isolated issue, because it requires either non-noop updates to the relevant fields or deletion/reinsertion, once the mapping entries are added. The cleanest way to do it would probably be for repairkit to
This should be idempotent and avoid any potential for data loss, and could be run from a local env to avoid blowing out the cloud-sweeper task runtime.
What's the source for the mapping types? That DD you pointed me to a little while back?
After a little searching, it looks like there may be a slightly-easier solution, apparently ES/OS documents are immutable, and therefore any meaningful (non-noop) update to a document will trigger a re-index of the entire document.
Ergo, it should be sufficient to add all missing properties to the index, and then write a metadata flag value (showing that the document has been checked) for all unchecked documents, with no need to play around with temporary indices.
EDIT: Yep, this is the case, tested and confirmed.
Going well! π Details? See above β
Implemented in index-mapping-repair
with the exception of two wrinkles:
resolution of missing property mapping typename TBD (@jordanpadams please weigh in on that)
@jordanpadams @tloubrieu-jpl the sweeper queries twice - once to generate the set of missing mappings, then again to generate/write the doc updates once the mappings have been ensured. These two queries need to return consistent results, otherwise an old version of harvest could write new documents in the middle of a sweep which would get picked up in the second stage but not the first.
In that (theoretically-possible but shockingly-unlikely) event, those documents would erroneously be marked as fixed and excluded from future sweeps, and there only way to detect them would be to manually run the sweeper with the redundant-work filter disabled. Pick an option, in increasing order of rigor:
The likelihood of someone running an obsolete version of harvest at exactly the wrong time is functionally zero - don't guard against it.
Instead of filtering to "documents which haven't been swept before", apply an additional constraint of "harvest time is earlier than sweeper execution start".
Use a point-in-time search.
3 is the most-correct option, but may not be compatible with our dockerized registry, so I'd prefer to go with 2, or 1 if you're absolutely sure no-one will run a pre-2023 version of harvest at just the wrong time.
Resolve missing types by cracking open the doc's blob, extracting the DD url, and reading it.
Cache downloaded DDs, cache cracked blobs, and avoid cracking for mappings which have already been resolved by the sweeper.
per @jordanpadams, log earliest/latest harvest timestamp for affected files, and a unique list of harvest versions. Pull these from the docs themselves.
Per @jordanpadams,
I think you can use the -dd indexes in the registry for tracking down these classes/attributes.
status: implemented, in review per @jordanpadams review/live-test postponed until next week, after the current site demos.
status: testing against MCP rms in-progress.
The initial run got 40min and about halfway through, then AOSS throttled. I doubt this is something we care to address if sweepers initialization is the only thing that hits whatever limits are imposed.
@jordanpadams @tloubrieu-jpl ~the query referred to in the OP now successfully hits on a single document.~ 1830h EDIT: ~Well, it did... it isn't appearing to now. I'll need to investigate this further.~ EDIT 2: aaand it's working again. Probably just been some weird reindexing stuff going on.
Once it checks out, want me to run it against all the other nodes, and include all the sweepers (not just the reindexer)?
EDIT: Sweeper is exhibiting the same result-skipping behaviour as repairkit, which I should've seen coming. I'll implement the same fix as was applied there.
For rms, problems were detected for harvest version 3.8.1, and harvest timestamps 2022-06-28 through 2024-03-13.
Logs were long due to many documents not having harvest versions and throwing warnings, so I haven't sent them through - @jordanpadams let me know if you'd like me to strip those out and send them.
Confirmed with en
that a single run is sufficient to reindex all documents. Currently running manually against all nodes, storing logs for later analysis
Status: running against large indices appears to overload those indices. Need to figure out a way to consistently page through the documents.
Given the way it works:
the workload can be chunked without issue if the first/second query can be guaranteed to return the same result-set (this means PIT, most-likely, or sorting by harvest timestamp if that's infeasible), since any product which has yet to be processed will eventually be updated/reindexed, and any product which is updated/reindexed is guaranteed to have had its appropriate mappings created already.
if that is difficult or impossible, a naive approach which pages blindly could work iff the update generation step also checks that the mapping is present, not yielding an update if a missing mapping exists at update-creation-time
PIT search is only available as of OpenSearch 2.4, and while AWS OpenSearch Service supports OpenSearch 2.15, AWS Serverless Collections currently uses OpenSearch 2.0
Serverless collections currently run OpenSearch version 2.0.x. As new versions are released, OpenSearch Serverless will automatically upgrade your collections to consume new features, bug fixes, and performance improvements.
so point-in-time search is not available to us at this point and a stopgap solution must be implemented.
Checked for duplicates
Yes - I've already checked
π Describe the bug
When I tried to search by
cassini:ISS_Specific_Attributes.cassini:image_number
via the API, I get no results, when I should get many.π΅οΈ Expected behavior
I expected to be able to search by this field and get a result
π To Reproduce
https://pds.mcp.nasa.gov/api/search/1/products?q=(cassini:ISS_Specific_Attributes.cassini:image_number%20eq%20%221454725799%22) should return 1 result: https://pds-rings.seti.org/pds4/bundles/cassini_iss_saturn//data_raw/14547xxxxx/1454725799n.xml
Same thing in Kibana Discover, no go.
π₯ Environment Info
π Version of Software Used
No response
π©Ί Test Data / Additional context
No response
π¦ Related requirements
π¦ NASA-PDS/registry-api#539
βοΈ Engineering Details
I am concerned more broadly that attributes throughout the systems are randomly unsearchable because harvest is not or was not properly creating fields in the schema prior to loading them into the index. Not sure how we can scrub this, but a sweeper may be necessary to somehow scan and fix this all the time.
π Integration & Test
No response