Closed bengarrett33 closed 3 months ago
Thanks for trying out HSDS!
The 503 responses aren't failures as such - it's just the server telling you to lower the rate of requests and retry the given request after a bit of delay.
Backgroiund: Each HSDS container keeps track of the number or inflight requests. If it exceeds a certain limit (100 by default), it will fail the next requests with a 503 response. The idea is not to overtax the server or having issues with the container running out of memory.
The actual effective limit depends greatly on the type of load coming in (basically how memory or cpu intensive the requests are on average). You can change the default by creating a file: hsds/admin/config/override.yml with the following line: "max_tax_count: 999" where 999 is whatever you'd like to max_task_count to be. Restart the server to have it take effect.
If you see the containers regularly hitting 100% CPU, or restarting because of out of memory errors, you've probably set max_task_count to high and you'll want to scale it back a bit.
Let me know if this helps!
Very helpful thank you!
I am trying to create a HSDS to serve data for my application. When I make multiple parallel
get
's to the service from my application, I frequently get503
's. I can mitigate this problem by implementing a generous retry policy in my application, but my question is: Is this expected? Are there any ways to improve availability \ reliability of the HSDS? I see in the data owner's documentation mention of the same error I am seeing, but it isn't clear to me exactly what the problem is or how my HSDS deployment could be more robust to parallel requests.To reproduce: Start a HSDS container using latest image:
docker-compose.yml
Then make parallel requests for data using a client, in this case
h5pyd
:This will (usually) fail with:
This indicates the client is receiving a
503
from the HSDS. The log filehs.log
does not indicate any sort or error occurred in the HSDS container. Adding retries usually allows me to get the data I need from the HSDS, but I am trying to understand why the service is unavailable and if there is a way to configure the HSDS to be more robust to parallel requests.