Open ericfranz opened 8 years ago
A note: when the process hangs like this, sometimes not even SIGKILL will work. Here, I kill -9 49811
but it is still there.
Is the kill -9 49811
off the screen? Because in the left-window I see you drop out of the wait in the current process, but you don't mess with the actual process doing the file list.
You might need to generate a thread with sleep
and if it reaches the end of the sleep
you Process.kill(9, 46811)
, if the waitpid
returns results before then you terminate the sleep
thread.
Ah, you could also rescue the exception thrown by the Timeout
and just Process.kill
then.
Yes, kill -9 49811
is off the screen.
Another option that might work:
The point of the Dir.open(outdir).close
was to force nfs to update the files in the directory. We could use a similar method. Fork and do the Dir.open(path).close and just check to see if the process hangs; if it doesn't you might be good to go.
In production for a period of time all of the OnDemand and AweSim dashboards became unresponsive to the point of having a Proxy Error because of this. The dashboard was completely unresponsive and SUG is going on.
This is the first time I've observed this problem outside of downtimes where hiccups are expected.
Several suggestions have been made.
A solution for this is to create a CircuitBreaker https://martinfowler.com/bliki/CircuitBreaker.html but execute the block of ruby in fork. The return value could be passed back from child to parent process using https://ruby-doc.org/core-2.3.0/IO.html#method-c-pipe and the data passed could be the return value http://ruby-doc.org/core-2.0.0/Marshal.html.
The return value of the block is likely to be a boolean in this case, but a general purpose circuit breaker like this could be use in other cases that use the file system.
When using the circuit breaker, passing the block and the expected return value if the breaker trips. One thing that is tricky is the circuit breaker would also need a way to cache the value for the particular block so that on a future request the tripped return value is used.
This probably wouldn't address the situation where kill -9
fails. But it may work in other cases.
In the case above this would be the block executed in fork:
candidate_favorite_paths.select {|p| p.directory? && p.readable? && p.executable? }
And of course the return value would be an array of Pathname objects.
At OSC we added a check in https://github.com/OSC/osc-ood-config/pull/177 to only modify the favorite candidate path if the prometheus metric is 1 (failing).
We (OSC) monitors the filesystem through a crontab which seems to be able to interruptible (perhaps because it has no TTY?). I've found no viable way to directly check because most every command or similar hangs.
In any case - that may be the only viable thing here - to toggle this functionality off of a file that was generated by some other program. That file then indicates whether there's an issue. In our case it's a prometheus formatted file that we grep on. Obviously this specific scheme using Prometheus isn't completely portable so we may have to come up with something convention that is.
The issue with this ticket is around Ruby's implementation of timeout and the fact that the process is in uninterruptible sleep.
https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/
Just to circle back to this, this can happen in any file operation. So notably, if the dashboard boots up correctly (from the initial comment the issue stems from favorite paths during initialization) but then the NFS storages become unstable, the dashboard can hang while traversing that NFS storage in the files app.
Here's the stack output of a PUN that had halted waiting for an NFS operation to complete. Note that the dashboard app was still running, so we didn't stop the PUN.
The user then can't log back in because the app is hung in this state.
App 1759505 output: # Thread: #<Thread:0x00007f45f3229a08 /opt/ood/ondemand/root/usr/share/gems/3.1/ondemand/3.1.7-1/gems/actionpack-6.1.7.6/lib/action_controller/metal/live.rb:300 sleep>(Worker 1), alive = true
App 1759505 output: ------------------------------------------------------------
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `lstat'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `lstat'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `initialize'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:97:in `new'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:97:in `block in ls'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `each'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `each'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `map'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `ls'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/controllers/files_controller.rb:36:in `block (2 levels) in fs'
App 1759505 output: /opt/ood/ondemand/root/usr/share/gems/3.1/ondemand/3.1.7-1/gems/actionpack-6.1.7.6/lib/action_controller/metal/mime_responds.rb:214:in `respond_to'
App 1759505 output: /var/www/ood/apps/sys/dashboard/app/controllers/files_controller.rb:18:in `fs'
Furthermore looking into this issue - it's likely going to take a lot of work. I'm finding that you can't kill a Thread
that's in this state. You could kill a Process
but that brings all sorts of issues like communicating the result from one process to another along with scheduling/limiting processes which implies some sort of queueing process so we don't just fork bomb ourselves.
See comment at bottom of this article. Solution is to add a circuit breaker we can use for File IO in OnDemand.
When an GPFS or NFS mounted volume becomes unavailable, processes doing IO to these volumes puts the process in an uninterruptible sleep state.
This happens periodically at OSC when GPFS is unavailable. Doing an ls /fs/project or checking for the existence of a directory under /fs/project will cause the server to hang. This often happens on the dashboard in this block:
https://github.com/OSC/ood-dashboard/blob/ea579eccff193a8b45f6e6d98b1aa5bcfdf385f2/app/apps/ood_files_app.rb#L20-L24
Naively adding a Timeout around this will not work:
However, Timeout::timeout will correctly cancel a Process.wait call if the child process that Process.wait is waiting on is in an uninterruptible sleep state:
We probably should time-out when trying to determine if these don't exist, and perhaps let the user know that we timed-out when trying to determine the existence of these directories.
┆Issue is synchronized with this Asana task by Unito