Unable to search for`cassini` LDD attributes in ISS datasets

jordanpadams commented 1 month ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I tried to search by cassini:ISS_Specific_Attributes.cassini:image_number via the API, I get no results, when I should get many.

🕵️ Expected behavior

I expected to be able to search by this field and get a result

📜 To Reproduce

https://pds.mcp.nasa.gov/api/search/1/products?q=(cassini:ISS_Specific_Attributes.cassini:image_number%20eq%20%221454725799%22) should return 1 result: https://pds-rings.seti.org/pds4/bundles/cassini_iss_saturn//data_raw/14547xxxxx/1454725799n.xml

Same thing in Kibana Discover, no go.

🖥 Environment Info

Version of this software [e.g. vX.Y.Z]
Operating System: [e.g. MacOSX with Docker Desktop vX.Y] ...

📚 Version of Software Used

No response

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 NASA-PDS/registry-api#539

⚙️ Engineering Details

I am concerned more broadly that attributes throughout the systems are randomly unsearchable because harvest is not or was not properly creating fields in the schema prior to loading them into the index. Not sure how we can scrub this, but a sweeper may be necessary to somehow scan and fix this all the time.

🎉 Integration & Test

No response

alexdunnjpl commented 1 month ago

Confirmed that document is in rms-registry and contains relevant key and value.

Confirmed that the key is missing from rms-registry _mapping (and in fact there is no mapping for any attribute referencing "cassini".

Harvest date/time is 2022-06-28T23:09:48.274461Z, so my soft assumption is that this is the result of a bug or missing feature in harvest which has since been implemented.

I would suggest re-harvesting that product and re-testing to confirm that the expected entries are added to the index mappings.

@jordanpadams this will be pretty delicate and (computationally) expensive to fix with repairkit if it isn't a fairly isolated issue, because it requires either non-noop updates to the relevant fields or deletion/reinsertion, once the mapping entries are added. The cleanest way to do it would probably be for repairkit to

iterate through the doc corpus and for each doc
- update the mappings
- flag documents requiring re-indexing using a metadata property
re-index flagged documents to a temporary index
delete flagged documents from the source index
reindex the temp index back to the source index
delete the temp index

This should be idempotent and avoid any potential for data loss, and could be run from a local env to avoid blowing out the cloud-sweeper task runtime.

What's the source for the mapping types? That DD you pointed me to a little while back?

alexdunnjpl commented 1 month ago

After a little searching, it looks like there may be a slightly-easier solution, apparently ES/OS documents are immutable, and therefore any meaningful (non-noop) update to a document will trigger a re-index of the entire document.

Ergo, it should be sufficient to add all missing properties to the index, and then write a metadata flag value (showing that the document has been checked) for all unchecked documents, with no need to play around with temporary indices.

EDIT: Yep, this is the case, tested and confirmed.

nutjob4life commented 1 month ago

Going well! 🎉 Details? See above ↑

alexdunnjpl commented 2 weeks ago

Implemented in index-mapping-repair with the exception of two wrinkles:

resolution of missing property mapping typename TBD (@jordanpadams please weigh in on that)

@jordanpadams @tloubrieu-jpl the sweeper queries twice - once to generate the set of missing mappings, then again to generate/write the doc updates once the mappings have been ensured. These two queries need to return consistent results, otherwise an old version of harvest could write new documents in the middle of a sweep which would get picked up in the second stage but not the first.

In that (theoretically-possible but shockingly-unlikely) event, those documents would erroneously be marked as fixed and excluded from future sweeps, and there only way to detect them would be to manually run the sweeper with the redundant-work filter disabled. Pick an option, in increasing order of rigor:

The likelihood of someone running an obsolete version of harvest at exactly the wrong time is functionally zero - don't guard against it.
Instead of filtering to "documents which haven't been swept before", apply an additional constraint of "harvest time is earlier than sweeper execution start".
Use a point-in-time search.

3 is the most-correct option, but may not be compatible with our dockerized registry, so I'd prefer to go with 2, or 1 if you're absolutely sure no-one will run a pre-2023 version of harvest at just the wrong time.

alexdunnjpl commented 2 weeks ago

Resolve missing types by cracking open the doc's blob, extracting the DD url, and reading it.

Cache downloaded DDs, cache cracked blobs, and avoid cracking for mappings which have already been resolved by the sweeper.

alexdunnjpl commented 2 weeks ago

per @jordanpadams, log earliest/latest harvest timestamp for affected files, and a unique list of harvest versions. Pull these from the docs themselves.

alexdunnjpl commented 2 weeks ago

Per @jordanpadams,

I think you can use the -dd indexes in the registry for tracking down these classes/attributes.

alexdunnjpl commented 1 week ago

status: implemented, in review per @jordanpadams review/live-test postponed until next week, after the current site demos.

alexdunnjpl commented 1 week ago

status: testing against MCP rms in-progress.

The initial run got 40min and about halfway through, then AOSS throttled. I doubt this is something we care to address if sweepers initialization is the only thing that hits whatever limits are imposed.

@jordanpadams @tloubrieu-jpl ~the query referred to in the OP now successfully hits on a single document.~ 1830h EDIT: ~Well, it did... it isn't appearing to now. I'll need to investigate this further.~ EDIT 2: aaand it's working again. Probably just been some weird reindexing stuff going on.

Once it checks out, want me to run it against all the other nodes, and include all the sweepers (not just the reindexer)?

EDIT: Sweeper is exhibiting the same result-skipping behaviour as repairkit, which I should've seen coming. I'll implement the same fix as was applied there.

alexdunnjpl commented 1 week ago

For rms, problems were detected for harvest version 3.8.1, and harvest timestamps 2022-06-28 through 2024-03-13.

Logs were long due to many documents not having harvest versions and throwing warnings, so I haven't sent them through - @jordanpadams let me know if you'd like me to strip those out and send them.

alexdunnjpl commented 6 days ago

Confirmed with en that a single run is sufficient to reindex all documents. Currently running manually against all nodes, storing logs for later analysis

alexdunnjpl commented 5 days ago

Status: running against large indices appears to overload those indices. Need to figure out a way to consistently page through the documents.

Given the way it works:

the workload can be chunked without issue if the first/second query can be guaranteed to return the same result-set (this means PIT, most-likely, or sorting by harvest timestamp if that's infeasible), since any product which has yet to be processed will eventually be updated/reindexed, and any product which is updated/reindexed is guaranteed to have had its appropriate mappings created already.
if that is difficult or impossible, a naive approach which pages blindly could work iff the update generation step also checks that the mapping is present, not yielding an update if a missing mapping exists at update-creation-time

alexdunnjpl commented 4 days ago

PIT search is only available as of OpenSearch 2.4, and while AWS OpenSearch Service supports OpenSearch 2.15, AWS Serverless Collections currently uses OpenSearch 2.0

Serverless collections currently run OpenSearch version 2.0.x. As new versions are released, OpenSearch Serverless will automatically upgrade your collections to consume new features, bug fixes, and performance improvements.

so point-in-time search is not available to us at this point and a stopgap solution must be implemented.

NASA-PDS / registry-sweepers