CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

[UI] UI workers occasionally stops serving requests #724

Closed mreyescdl closed 2 years ago

mreyescdl commented 3 years ago

Root cause of unresponsiveness not found. Logs shows activity prior to problem, but not excessive. Librato does not reflect host being over-resourced.

A simple restart of Puma fixes the issue.

Occurences

Possible fixes

No effect

Next Steps

Future Ideas

terrywbrady commented 3 years ago

UI04 on 8/20

I, [2021-08-20T15:27:15.408052 #17669]  INFO -- : [ae554367-238b-4fc4-b8db-9ae1d350122c] Completed 200 OK in 85ms (ActiveRecord: 3.6ms)
I, [2021-08-20T16:04:53.607467 #10262]  INFO -- : [afea795e-3a9a-4dfd-b3e5-b28f6fea0c68] Started GET "/" for 172.30.28.241 at 2021-08-20 16:04:53 -0700

No interesting journal data during this time. Restarted puma after a Nagios alert.

elopatin-uc3 commented 3 years ago

Another alert appeared on 8/22 over the weekend. However this time the issue cleared up on its own. Nagios noted a recovery roughly 20 minutes after first alert.

Additional notes:

mreyescdl commented 2 years ago

Single mode vs Clustered mode. We are running in single mode with the default 5 threads max. A lot to take in with this subject, but here is a nice synopsis:

But what does it all mean? So, if you’ve been paying attention so far, you’ve realized that a scalable Ruby web application needs slow client protection in the form of request buffering, and slow response protection in the form of some kind of concurrency - either multithreading or multiprocess/forking (preferably both). That only leaves Puma in clustered mode and Phusion Passenger 5 as scalable solutions for Ruby applications on Heroku running MRI/C Ruby. If you’re running your own setup, Unicorn with nginx becomes a viable option.

Source: https://www.speedshop.co/2015/07/29/scaling-ruby-apps-to-1000-rpm.html

mreyescdl commented 2 years ago

Please run the following on the worker before restarting the service

$ netstat | egrep -e 'LISTEN | ESTABLISHED'

$ ps -efa

Look for how many and in what state the Puma threads are: $ htop -u dpr2 -p $(cat /dpr2/apps/ui/current/pid/puma.pid)

elopatin-uc3 commented 2 years ago

After discussion with Scott and Ryan, Dryad will be adding a retry for presigned URL requests: https://github.com/CDL-Dryad/dryad-product-roadmap/issues/1530

We will also request ALB logging, and Ryan will continue to log timeouts to a spreadsheet. We'll match those against any new logging that is enabled.

terrywbrady commented 2 years ago

@elopatin-uc3 , I recommend that we break out the issue that Dryad is seeing from this issue. It may be the same root cause, but the symptoms are different.