TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

NodeNorm ITRB Prod suffering intermittent request failures (503s and 504s) #175

Closed gaurav closed 1 year ago

gaurav commented 1 year ago

On Translator Slack, people have reported NodeNorm requests failing with 503 Service Unavailable or 504 Gateway Timeout. This may be caused by Redis issues (where requests take too long or time out), NodeNorm issues or Nginx issues on ITRB.

I think we should set up a stress-test system for NodeNorm which makes a large number of simultaneous queries to see if we can replicate this issue (probably on RENCI or on NodeNorm ITRB Test instead of NodeNorm ITRB Prod), and then look at the logs to see if we can figure out why this is happening.

YaphetKG commented 1 year ago

After some intesive load testing on our servers in a test enviroment we have the following upcoming changes:

  1. In our deployments we support two sets for replicas , one for the root path and another for 1.3 path. we have 7 instances that scale to 10 instances for each replica set. But most users use the root path and the /1.3 path rarely gets used. So we have now created a re-route of traffic to the root path replica set, after reclaiming resources used by the 1.3 replica set. We have 10 pods starting and scale to 20 pods depending on cpu and memory usage as a metric for heavy traffic.
  2. We have changed our web-server from uvicorn to gunicorn as following best practice for production deployments.
  3. Implemented internal batching for redis call to give the process breaks while fetching data so under traffic there would still be cycles where the async threads are free to recieve more requests.

Next steps: In ITRB enviroment, we have redis cluster setup as our storage. The python driver in this webserver for redis-cluster is not currently async. We would want to refactor that to use async redis-cluster library. This will help remove the blocking. The changes above would reduce the 50x error but having async would be a big lift too.

gaurav commented 1 year ago

It looks like has been fixed now (as confirmed by @cbizon's UptimeRobot at https://stats.uptimerobot.com/g9MlwHqXOy). Closing.