Closed mreyescdl closed 2 years ago
UI04 on 8/20
I, [2021-08-20T15:27:15.408052 #17669] INFO -- : [ae554367-238b-4fc4-b8db-9ae1d350122c] Completed 200 OK in 85ms (ActiveRecord: 3.6ms)
I, [2021-08-20T16:04:53.607467 #10262] INFO -- : [afea795e-3a9a-4dfd-b3e5-b28f6fea0c68] Started GET "/" for 172.30.28.241 at 2021-08-20 16:04:53 -0700
No interesting journal data during this time. Restarted puma after a Nagios alert.
Another alert appeared on 8/22 over the weekend. However this time the issue cleared up on its own. Nagios noted a recovery roughly 20 minutes after first alert.
Additional notes:
Single mode vs Clustered mode. We are running in single mode with the default 5 threads max. A lot to take in with this subject, but here is a nice synopsis:
But what does it all mean? So, if you’ve been paying attention so far, you’ve realized that a scalable Ruby web application needs slow client protection in the form of request buffering, and slow response protection in the form of some kind of concurrency - either multithreading or multiprocess/forking (preferably both). That only leaves Puma in clustered mode and Phusion Passenger 5 as scalable solutions for Ruby applications on Heroku running MRI/C Ruby. If you’re running your own setup, Unicorn with nginx becomes a viable option.
Source: https://www.speedshop.co/2015/07/29/scaling-ruby-apps-to-1000-rpm.html
Please run the following on the worker before restarting the service
$ netstat | egrep -e 'LISTEN | ESTABLISHED'
$ ps -efa
Look for how many and in what state the Puma threads are:
$ htop -u dpr2 -p $(cat /dpr2/apps/ui/current/pid/puma.pid)
After discussion with Scott and Ryan, Dryad will be adding a retry for presigned URL requests: https://github.com/CDL-Dryad/dryad-product-roadmap/issues/1530
We will also request ALB logging, and Ryan will continue to log timeouts to a spreadsheet. We'll match those against any new logging that is enabled.
@elopatin-uc3 , I recommend that we break out the issue that Dryad is seeing from this issue. It may be the same root cause, but the symptoms are different.
Root cause of unresponsiveness not found. Logs shows activity prior to problem, but not excessive. Librato does not reflect host being over-resourced.
A simple restart of Puma fixes the issue.
Occurences
Possible fixes
No effect
Next Steps
TMPDIR=...
Future Ideas