OSC / ondemand

Supercomputing. Seamlessly. Open, Interactive HPC Via the Web
https://openondemand.org/
MIT License
283 stars 106 forks source link

Dashboard: Properly timeout when IO to unavailable or unstable volumes result in process in uninterruptible sleep state #240

Open ericfranz opened 8 years ago

ericfranz commented 8 years ago

See comment at bottom of this article. Solution is to add a circuit breaker we can use for File IO in OnDemand.

When an GPFS or NFS mounted volume becomes unavailable, processes doing IO to these volumes puts the process in an uninterruptible sleep state.

This happens periodically at OSC when GPFS is unavailable. Doing an ls /fs/project or checking for the existence of a directory under /fs/project will cause the server to hang. This often happens on the dashboard in this block:

https://github.com/OSC/ood-dashboard/blob/ea579eccff193a8b45f6e6d98b1aa5bcfdf385f2/app/apps/ood_files_app.rb#L20-L24

  def favorite_paths
    @favorite_paths ||= candidate_favorite_paths.select {|p|
      p.directory? && p.readable? && p.executable?
    }
  end

Naively adding a Timeout around this will not work:

  def favorite_paths
    @favorite_paths ||= Timeout::timeout(5) { candidate_favorite_paths.select {|p|
      p.directory? && p.readable? && p.executable?
    }}
  rescue Timeout::Error
    @favorite_paths = []
  end

However, Timeout::timeout will correctly cancel a Process.wait call if the child process that Process.wait is waiting on is in an uninterruptible sleep state:

irb(main):012:0> Timeout::timeout(5) { `ls /fs/project` }
Timeout::Error: execution expired
        from (irb):12:in ``'
        from (irb):12:in `block in irb_binding'
        from /opt/rh/rh-ruby22/root/usr/share/ruby/timeout.rb:89:in `block in timeout'
        from /opt/rh/rh-ruby22/root/usr/share/ruby/timeout.rb:34:in `block in catch'
        from /opt/rh/rh-ruby22/root/usr/share/ruby/timeout.rb:34:in `catch'
        from /opt/rh/rh-ruby22/root/usr/share/ruby/timeout.rb:34:in `catch'
        from /opt/rh/rh-ruby22/root/usr/share/ruby/timeout.rb:104:in `timeout'
        from (irb):12
        from /opt/rh/rh-ruby22/root/usr/bin/irb:11:in `<main>'Using fork would let you use ruby code. However, there is complexity added in sending back results from that code block to the parent process. Communication could only occur either through exit status or through pipes:https://ruby-doc.org/core-2.3.0/IO.html#method-c-pipeExample: Checking for project space path existence when unavailable causes very slow requestThe request to load the dashboard when GFPS is unavailable is very slow. Here, it loaded in 1.7 minutes:![screen shot 2016-10-12 at 5 12 27 pm](https://cloud.githubusercontent.com/assets/512333/19327866/1b18f15e-909f-11e6-9572-647c2ffb8dec.png)This is because I have two groups that start with `P` and so the dashboard looks for the existence of both `/fs/project/PZS0562` and `/fs/project/PZS0530`.

We probably should time-out when trying to determine if these don't exist, and perhaps let the user know that we timed-out when trying to determine the existence of these directories.

┆Issue is synchronized with this Asana task by Unito

ericfranz commented 7 years ago

A note: when the process hangs like this, sometimes not even SIGKILL will work. Here, I kill -9 49811 but it is still there.

screen shot 2016-12-27 at 2 36 56 pm

nickjer commented 7 years ago

Is the kill -9 49811 off the screen? Because in the left-window I see you drop out of the wait in the current process, but you don't mess with the actual process doing the file list.

You might need to generate a thread with sleep and if it reaches the end of the sleep you Process.kill(9, 46811), if the waitpid returns results before then you terminate the sleep thread.

Ah, you could also rescue the exception thrown by the Timeout and just Process.kill then.

ericfranz commented 7 years ago

Yes, kill -9 49811 is off the screen.

ericfranz commented 7 years ago

Another option that might work:

https://github.com/AweSim-OSC/vncsim/blob/760035052f6ca7d2e1ce4e79033370c86fc0fd5f/app/models/session.rb#L267-L271

The point of the Dir.open(outdir).close was to force nfs to update the files in the directory. We could use a similar method. Fork and do the Dir.open(path).close and just check to see if the process hangs; if it doesn't you might be good to go.

ericfranz commented 7 years ago

In production for a period of time all of the OnDemand and AweSim dashboards became unresponsive to the point of having a Proxy Error because of this. The dashboard was completely unresponsive and SUG is going on.

screen shot 2017-04-06 at 11 54 54 am

This is the first time I've observed this problem outside of downtimes where hiccups are expected.

ericfranz commented 7 years ago

Several suggestions have been made.

  1. Ensure the scan happens only on the index request
  2. Move the code that gets this list of "favorites to a separate route that returns JSON or HTML and call this via AJAX.
  3. Until this loads show "/fs/project" and "/fs/scratch" links with a message "Fetching shortcuts"
  4. Initiate this ajax request at the bottom of the page (i.e. javascript that runs after everything else has run/loaded)
ericfranz commented 6 years ago

A solution for this is to create a CircuitBreaker https://martinfowler.com/bliki/CircuitBreaker.html but execute the block of ruby in fork. The return value could be passed back from child to parent process using https://ruby-doc.org/core-2.3.0/IO.html#method-c-pipe and the data passed could be the return value http://ruby-doc.org/core-2.0.0/Marshal.html.

The return value of the block is likely to be a boolean in this case, but a general purpose circuit breaker like this could be use in other cases that use the file system.

When using the circuit breaker, passing the block and the expected return value if the breaker trips. One thing that is tricky is the circuit breaker would also need a way to cache the value for the particular block so that on a future request the tripped return value is used.

This probably wouldn't address the situation where kill -9 fails. But it may work in other cases.

ericfranz commented 6 years ago

In the case above this would be the block executed in fork:

candidate_favorite_paths.select {|p| p.directory? && p.readable? && p.executable? }

And of course the return value would be an array of Pathname objects.

johrstrom commented 3 years ago

At OSC we added a check in https://github.com/OSC/osc-ood-config/pull/177 to only modify the favorite candidate path if the prometheus metric is 1 (failing).

We (OSC) monitors the filesystem through a crontab which seems to be able to interruptible (perhaps because it has no TTY?). I've found no viable way to directly check because most every command or similar hangs.

In any case - that may be the only viable thing here - to toggle this functionality off of a file that was generated by some other program. That file then indicates whether there's an issue. In our case it's a prometheus formatted file that we grep on. Obviously this specific scheme using Prometheus isn't completely portable so we may have to come up with something convention that is.

johrstrom commented 5 months ago

The issue with this ticket is around Ruby's implementation of timeout and the fact that the process is in uninterruptible sleep.

https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/

johrstrom commented 1 month ago

Just to circle back to this, this can happen in any file operation. So notably, if the dashboard boots up correctly (from the initial comment the issue stems from favorite paths during initialization) but then the NFS storages become unstable, the dashboard can hang while traversing that NFS storage in the files app.

Here's the stack output of a PUN that had halted waiting for an NFS operation to complete. Note that the dashboard app was still running, so we didn't stop the PUN.

The user then can't log back in because the app is hung in this state.

App 1759505 output: # Thread: #<Thread:0x00007f45f3229a08 /opt/ood/ondemand/root/usr/share/gems/3.1/ondemand/3.1.7-1/gems/actionpack-6.1.7.6/lib/action_controller/metal/live.rb:300 sleep>(Worker 1), alive = true
App 1759505 output: ------------------------------------------------------------
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `lstat'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `lstat'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `initialize'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:97:in `new'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:97:in `block in ls'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `each'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `each'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `map'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `ls'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/controllers/files_controller.rb:36:in `block (2 levels) in fs'
App 1759505 output:     /opt/ood/ondemand/root/usr/share/gems/3.1/ondemand/3.1.7-1/gems/actionpack-6.1.7.6/lib/action_controller/metal/mime_responds.rb:214:in `respond_to'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/controllers/files_controller.rb:18:in `fs'
johrstrom commented 1 month ago

Furthermore looking into this issue - it's likely going to take a lot of work. I'm finding that you can't kill a Thread that's in this state. You could kill a Process but that brings all sorts of issues like communicating the result from one process to another along with scheduling/limiting processes which implies some sort of queueing process so we don't just fork bomb ourselves.