NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

`superseded_by` fields with `null` values causing multiple versions of same products to appear in API searches #112

Closed jordanpadams closed 1 month ago

jordanpadams commented 3 months ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

https://pds.nasa.gov/api/search/1/products?q=(lid%20eq%20%22urn:nasa:pds:lro_lola_edr:data_raw%22) 2 records appear.

🕵️ Expected behavior

I expected 1 product to be returned

📜 To Reproduce

https://pds.nasa.gov/api/search/1/products?q=(lid%20eq%20%22urn:nasa:pds:lro_lola_edr:data_raw%22)

"summary": {
  "q": "(lid eq \"urn:nasa:pds:lro_lola_edr:data_raw\")",
  "hits": 2,
  "took": 90,
  "search_after": [],
  "limit": 100,
  "sort": [],
  "properties": [
  ...
  "lidvid": [
    "urn:nasa:pds:lro_lola_edr:data_raw::4.0"
  ],
  ...
  "ops:Provenance.ops:superseded_by": [
    "null"
  ],
  ..
  "lidvid": [
    "urn:nasa:pds:lro_lola_edr:data_raw::5.0"
  ],
  ...

🖥 Environment Info

Chrome

📚 Version of Software Used

Latest

🩺 Test Data / Additional context

No response

🦄 Related requirements

No response

⚙️ Engineering Details

No response

alexdunnjpl commented 2 months ago

@jordanpadams I'll investigate shortly, but is it plausible that the two documents in question exist on different nodes and that's why they lack correct provenance history?

jordanpadams commented 2 months ago

@alexdunnjpl hmmm. that would be interesting. they both say GEO, but I wonder if we loaded that data as a test case into the EN registry?

alexdunnjpl commented 2 months ago

I'll dig into that now. If it ends up being the case that they're in two nodes,

alexdunnjpl commented 2 months ago

@jordanpadams if I'm interpreting the swagger docs correctly, I don't think the request/expectation in the original ticket is even valid.

latest/all subroutes are only defined for /products/{identifier}/{all | latest}. In that case, it's reasonable to expect that no latest/all specification yields only the latest product.

But why would there be any expectation that /products/?q=somequery would only return the latest version of the LIDs of the full set of products resulting from that query?

  1. Intuitively, the result of a query should be that you get what you ask for
  2. Even if 1) isn't persuasive, there's no way to specify /products/all?q=somequery because all would be parsed as an identifier. And it seems implausible that there should be no way to use a custom query to get all the products you want.

EDIT: Having said all that, there are actually four versions in the registry - 1.0, 2.0, 4.0, 5.0, so it looks as though the API may indeed be filtering to superseded_by=null in that query

Looking for the v5.0 doc, it's in geo just like the others, but it doesn't have any sweepers metadata set.

I'm running a local instance of sweepers on it - my suspicion is that GEO hasn't been running sweepers, based on what I'm seeing.

jordanpadams commented 2 months ago

But why would there be any expectation that /products/?q=somequery would only return the latest version of the LIDs of the full set of products resulting from that query?

It is a requirement of the system that only the latest versions of products be returned, by default, unless otherwise stated.

https://github.com/NASA-PDS/registry-api/issues/428 https://github.com/NASA-PDS/registry-api/issues/426

Moving the rest of the discussion to NASA-PDS/registry-api#428 since I think it is more deservedly had there.

alexdunnjpl commented 2 months ago

Task definition is v5 ARNarn:aws:ecs:us-west-2:445837347542:task-definition/pds-geo-prod-registry-sweepers-task:5 despite latest version of that task being v9 arn:aws:ecs:us-west-2:445837347542:task-definition/pds-geo-prod-registry-sweepers-task:9

This may explain #123 as well, if it is also lagging behind in task-definition versions and ending up using an out-of-date image as a result.

@sjoshi-jpl is there any reason why the schedulers shouldn't be targeting the latest version of the relevant task definition?

alexdunnjpl commented 1 month ago

@sjoshi-jpl re-pinging for the question above. "This doesn't matter because we're moving to MCP" is an acceptable answer, too.

tloubrieu-jpl commented 1 month ago

If we use the latest version of the reistry-sweeper this should not happend again.

superseeded_by: null should be only for the latest product.

alexdunnjpl commented 1 month ago

N.B. the root cause here is that sweepers has not run successfully since the latest product was ingested (probably due to the timeout issue).

It's not a bug, but may resurface.