CDLUC3 / ezid

CDLUC3 ezid
MIT License
10 stars 4 forks source link

Resolve EZID performance issues #457

Open jsjiang opened 1 year ago

jsjiang commented 1 year ago

EZID users get 5XX errors occasionally when the system is overloaded. While discussing the 502 errors the Merritt team reported, we came up with some ideas/areas that may help improving EZID performance.

5XX error code:

Outcomes from Mark, Ashley and Jing's meeting on 8/3: Action items:

  1. Write custom WAL rules to reduce malicious requests - Jing with Ashley's support #452
  2. Replicate 502 error in the EZID stage environment - Mark & Jing (targeting the minting operation that requires read/write access to both the Berkeley DB and MySQL) #453
  3. Refactoring Merritt 502 error handling - Mark (re-try minting operation when receiving 502 error)
  4. Migrate berkeley DB to MySQL - Jing
  5. Develop a testing/evaluating process before proceeding to the following options - Jing with Ashley's support - Performed load tests using Locust against ezid-dev/stg/prd; documented test results (2023-08)
  6. Adjust Apache/mod_wsgi rate limiting
  7. Adjust mod_wsgi keep-alive and ALB timeout settings
  8. Increase Apache concurrent requests limit
  9. Upgrade EC2 instance #451
  10. Explore other AWS tools/technologies on request limit control such as API gateway throttling settings.
  11. refactor EZID search function to limit results size and reduce memory usage on RDS (#446)

Originally posted by @jsjiang in https://github.com/CDLUC3/ezid/issues/161#issuecomment-1664690982

jsjiang commented 5 months ago

High Availability implementation should have improved EZID performance. Close this ticket for now. Reopen if needed.