NASA-PDS / registry-api

Web API service for the PDS Registry, providing the implementation of the PDS Search API (https://github.com/nasa-pds/pds-api) for the PDS Registry.
https://nasa-pds.github.io/pds-api
Other
2 stars 5 forks source link

Investigate sporadic 500 and 504 errors with registry API #431

Open jordanpadams opened 2 months ago

jordanpadams commented 2 months ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I tried TBD queries, I was getting sporadic 504 errors.

Screenshot 2024-04-11 at 9 13 45 AM

🕵️ Expected behavior

I expected the API to work

📜 To Reproduce

TBD queries

See CloudFront and/or Registry logs.

🖥 Environment Info

latest deployed

📚 Version of Software Used

No response

🩺 Test Data / Additional context

No response

🦄 Related requirements

No response

⚙️ Engineering Details

No response

tloubrieu-jpl commented 2 months ago

Here are the 100 requests which came to the API when the instability was noticed: 7a541e10-8113-44f6-9bd1-f58951c81317.csv

the first errors came up after limit parameter was set to 10000 which can work, and worked in later tests but is above what we would expect (a few 100s).

The issue might also be related to simultaneous activities on the registry (e.g. sweepers).

jordanpadams commented 3 weeks ago

Similar issue identified by @anilnatha attempting the following query:

https://pds.nasa.gov/api/search/1/products?q=(product_class eq "Product_Context" and lid like "urn:nasa:pds:context:instrument_host:*")&limit=9999&fields=lid,vid,pds:Instrument_Host.pds:description,pds:Instrument_Host.pds:name,pds:Instrument_Host.pds:type
jordanpadams commented 2 weeks ago

@tloubrieu-jpl @alexdunnjpl for this issue, if we know the page size is too large, is there any way we can improve the error messaging that comes with these errors? Or is this a server timeout thing?

alexdunnjpl commented 2 weeks ago

@jordanpadams server timeout if it's what I think it is - basically the request is valid and reasonable prima facie, but then the volume (size, not count) of the data ends up taking too long to serve so the server calls it quits.

This could be confirmed by repeating the queries, limiting the requested fields to lidvid to ensure no large data volume is served, assuming the issues aren't so sporadic that you're unable to convince yourself that it's working without issue after some reasonable period of testing.

jordanpadams commented 2 weeks ago

@alexdunnjpl copy. is there any way we can do some other smart things on the API side to either keep the connection alive with the server?

alexdunnjpl commented 2 weeks ago

@jordanpadams I wouldn't think so (without a box full of bandaids or increasing timeout thresholds) - from my perspective it's on the client to respond appropriately to a 504 (by, for example, retrying with a smaller request)

I haven't dug into what's going on here, though - I'm making some assumptions.

jordanpadams commented 1 week ago

thanks @alexdunnjpl we will dig a bit further in the future to see what we can do. The problem here being with an internet browser and curl as clients, there is nothing to do there in terms of responding to this appropriately. We may just need to update the documentation to explicitly call out these errors.