[UI] UI workers occasionally stops serving requests

mreyescdl commented 3 years ago

Root cause of unresponsiveness not found. Logs shows activity prior to problem, but not excessive. Librato does not reflect host being over-resourced.

A simple restart of Puma fixes the issue.

Occurences

June 16th at 08:38 [UI03]
July 14th at 20:48 [UI03]
August 20th at 15:27 [UI04]
August 26th at 10:42 [UI03]
September 21 at 10:30 [UI03]- ETD Harvest in progress
September 25 at 6:53 [UI04]
September 25 at 6:41 [UI03]
September 27 at 10:30 [UI04]- ETD Harvest in progress
Oct 9 at 11:01 [UI03]
Oct 15 at 02:18:50 [UI03]
2021-10-19T10:29:56 [UI03]- ETD Harvest in progress
2021-10-19T10:29:56 [UI04]- ETD Harvest in progress
2021-10-21T10:18:25 [UI03] - ETD Harvest, UCB Harvest of a new collection
2021-11-06T12:12:17 UI03, no crawls in progress
2021-11-27T16:35 UI04
2021-11-25T03:42 UI05

Possible fixes

No effect

[x] Try to crash ui05 by querying open context while a crawl is in progress
[x] Ask eschol to crawl ui05 to see if we can reproduce

Next Steps

[x] Attempt puma/gem update
- See puma pr 2613
[x] Modify robots.txt to discourage crawling
[x] Ask eschol team to harvest only one collection at a time
[x] Confirm complete
[x] /tmp space on ui boxes. Can we re-configure how puma is using tmp space?
- Ashley will implement this change on 10/21
- Systemd file: TMPDIR=...

Future Ideas

Explore cluster mode for puma
Implement the RackAttack to throttle crawls
- https://github.com/CDLUC3/mrt-doc/issues/802
- https://github.com/CDLUC3/mrt-dashboard/pull/107
Explore parameterized page sizes for the atom feed (also requested by UCB). Could this improve efficiency of a crawl?
Unpack prior log files to see if an eschol harvest or open context or other paginated command precedes the crash
More aggressive Nagios monitoring and restarts

terrywbrady commented 3 years ago

UI04 on 8/20

I, [2021-08-20T15:27:15.408052 #17669]  INFO -- : [ae554367-238b-4fc4-b8db-9ae1d350122c] Completed 200 OK in 85ms (ActiveRecord: 3.6ms)
I, [2021-08-20T16:04:53.607467 #10262]  INFO -- : [afea795e-3a9a-4dfd-b3e5-b28f6fea0c68] Started GET "/" for 172.30.28.241 at 2021-08-20 16:04:53 -0700

No interesting journal data during this time. Restarted puma after a Nagios alert.

elopatin-uc3 commented 3 years ago

Another alert appeared on 8/22 over the weekend. However this time the issue cleared up on its own. Nagios noted a recovery roughly 20 minutes after first alert.

Additional notes:

We're a few versions behind on Puma (5.3.2 currently; there is a version 5.4.0 available); no Dependabot alerts, but perhaps we should look into updating.
If/when this occurs again, we should look into the amount of traffic on the site. Was Dryad impacted at all?

mreyescdl commented 2 years ago

Single mode vs Clustered mode. We are running in single mode with the default 5 threads max. A lot to take in with this subject, but here is a nice synopsis:

But what does it all mean? So, if you’ve been paying attention so far, you’ve realized that a scalable Ruby web application needs slow client protection in the form of request buffering, and slow response protection in the form of some kind of concurrency - either multithreading or multiprocess/forking (preferably both). That only leaves Puma in clustered mode and Phusion Passenger 5 as scalable solutions for Ruby applications on Heroku running MRI/C Ruby. If you’re running your own setup, Unicorn with nginx becomes a viable option.

Source: https://www.speedshop.co/2015/07/29/scaling-ruby-apps-to-1000-rpm.html

mreyescdl commented 2 years ago

Please run the following on the worker before restarting the service

$ netstat | egrep -e 'LISTEN | ESTABLISHED'

$ ps -efa

Look for how many and in what state the Puma threads are: $ htop -u dpr2 -p $(cat /dpr2/apps/ui/current/pid/puma.pid)

elopatin-uc3 commented 2 years ago

After discussion with Scott and Ryan, Dryad will be adding a retry for presigned URL requests: https://github.com/CDL-Dryad/dryad-product-roadmap/issues/1530

We will also request ALB logging, and Ryan will continue to log timeouts to a spreadsheet. We'll match those against any new logging that is enabled.

terrywbrady commented 2 years ago

@elopatin-uc3 , I recommend that we break out the issue that Dryad is seeing from this issue. It may be the same root cause, but the symptoms are different.

CDLUC3 / mrt-doc