NASA-PDS / registry-api

Web API service for the PDS Registry, providing the implementation of the PDS Search API (https://github.com/nasa-pds/pds-api) for the PDS Registry.
https://nasa-pds.github.io/pds-api
Apache License 2.0
2 stars 5 forks source link

API falsely reports 10000 hits for hits>10000 #343

Closed jordanpadams closed 1 year ago

jordanpadams commented 1 year ago

Checked for duplicates

Yes

🐛 Describe the bug

When I did https://pds.nasa.gov/api/search/1/products?limit=100&q=product_class%20eq%20%22Product_Observational%22&start=0 I noticed hits = 10000

🕵️ Expected behavior

I expected 1000000000 per https://github.com/NASA-PDS/registry-mgr/blob/main/src/main/resources/elastic/registry.json#LL5C36-L5C46

📜 To Reproduce

curl -X 'GET' \
  'https://pds.nasa.gov/api/search/1/products?limit=100&q=product_class%20eq%20%22Product_Observational%22&start=0' \ 
   json_pp | grep 'hits'

🖥 Environment Info

API v1.2

📚 Version of Software Used

No response

🩺 Test Data / Additional context

No response

🦄 Related requirements

No response

⚙️ Engineering Details

No response

jordanpadams commented 1 year ago

@tloubrieu-jpl @jimmie @sjoshi-jpl @alexdunnjpl I will setup a working meeting to debug this. this bug has reoccurred a few times now so I want to make sure we are covering all the bases.

alexdunnjpl commented 1 year ago

@jordanpadams N.B. max_result_window refers to the maximal page size for a single request, not the max quantity of results returned by a query.

I have a meeting now, but a cursory search shows that it may just be the hits count (not the actual results set) which is limited, in which case it may just be necessary to add track_total_hits to the responses sent out by the API. Relevant discussion

This is supported by testing with start=10000, which returns a full page of results (which it shouldn't, if the query is limited to 10000)

alexdunnjpl commented 1 year ago

Confirmed that appending track_total_hits=true to OpenSearch query results in correct count. registry-api is dropping the relation field (which is "gte" in this case) returned from OpenSearch.

Quick fix is to append this. If performance becomes an issue (it shouldn't), we can address that later by returning the relation object rather than the fully-resolved count.

@jordanpadams could you please re-triage?

jordanpadams commented 1 year ago

@alexdunnjpl triage complete.

jordanpadams commented 1 year ago

@alexdunnjpl is there a code update associated with this ticket coming?

alexdunnjpl commented 1 year ago

@jordanpadams yes, eventually (or sooner, if you want to bump the priority back up)

jordanpadams commented 1 year ago

copy thanks @alexdunnjpl we will keep it in the backlog until you complete your current task and we can re-evaluate

al-niessner commented 1 year ago

@alexdunnjpl @jordanpadams

Nothing to be done here unless we are going to be happy with 10+ instead of 11 for total hits. The relation 'gte' is nifty if we could have state as we could dump the current results and move to a scroll window which we would be using already and avoid the problem outright. The Java object for total hits is Integer so 10+ really is not going to work there either. I guess, just nothing to be done really.

jordanpadams commented 1 year ago

@al-niessner per @alexdunnjpl, can we update the API to us track_total_hits=true. Also, why does https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:orex.ovirs:data_calibrated::11.0/members?start=0&limit=500 return the correct value for hits but the query above does not?

al-niessner commented 1 year ago

@jordanpadams @alexdunnjpl

Sure you can add another field in the summary to indicate that there are more hits that you can never fetch. The reason you get the 10000 is that it is some max_something_or_other opensearch server configuration value. Some of our PDS units are set to 2**32-1 or some other ridiculously large value. Others are the default 10000. Should be able to search the old tickets in this repo to find the exact parameter. Something about max and window. Anyway, it is all a server configuration problem.

jordanpadams commented 1 year ago

@al-niessner that configuration was changed, this track_total_hits=true thing is different per some tests that @alexdunnjpl performed.

alexdunnjpl commented 1 year ago

@al-niessner here's the qparam in question

https://opensearch.org/docs/1.2/opensearch/rest-api/search/#url-parameters (see: track_total_hits)

By default, opensearch will not return an exact value for hits, presumably as this takes some nontrivial amount of extra time than not returning the exact value, when paging hits.

More relevant discussion/context from earlier in this ticket's chatter

al-niessner commented 1 year ago

@alexdunnjpl @jordanpadams

We need to discuss this one because my experience with opensearch is exactly the opposite. For instance, if I do this search https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:orex.ovirs:data_calibrated::11.0/members?limit=2 it gives me a total hits of 1169346. It seems that it is accurately counting them out despite not specifying the URI parameter - unless you have reason to believe that there are more of them. I mean if you are doing efficiency game, why not give up on 2 and save yourself the other 1169344 hits?

Also, the 10000 is just too coincidental to being the default window size for me to not to be Missourian - show me state. Not disparaging you, but reverting back to my Hume readings that eyewitness testimony is the weakest form of evidence. Apparently Hume was from Missouri long before its founding.

Now, maybe when a cluster or something has a node go missing and opensearch says I found what I found but there could be more on the missing node(s) then that would be nice. However, is that not a an error rather than obscure flag int eh summaary?

Lastly, would it not be simpler to just the URI parameter to true always rather than pass something back to the user? The parameter you named is an input to opensearch and not an output.

Yeah, breakout for this ticket.

jordanpadams commented 1 year ago

@al-niessner I believe @alexdunnjpl was thinking we just include track_total_hits in our queries to OpenSearch under the hood, by default, all the time. The users have no idea we are using that flag, it is just something we know we need to make these hits accurate.

al-niessner commented 1 year ago

Double checking the URL in the first comment. Changed start to 10001 and left limit alone:

curl 'https://pds.nasa.gov/api/search/1/products?limit=1000&q=product_class%20eq%20%22Product_Observational%22&start=10001'

It returned a non-error which indicates that the window is big validating @alexdunnjpl statements:

{
  "summary": {
    "q": "product_class eq \"Product_Observational\"",
    "hits": 10000,
    "took": 21511,
    "start": 10001,
    "limit": 1000,
    "sort": [],

@jordanpadams Sorry, I read adding it to search as meaning adding it to the summary we return to the user. Should be simple to add to all queries.

alexdunnjpl commented 1 year ago

If further validation is required, just take the API out of the loop and curl the OpenSearch directly, with and without the track_total_hits flag - that's the final-confirmation basis for my claim.

jordanpadams commented 1 year ago

@al-niessner

Sorry, I read adding it to search as meaning adding it to the summary we return to the user. Should be simple to add to all queries.

🎉