OSC / ondemand

Supercomputing. Seamlessly. Open, Interactive HPC Via the Web
https://openondemand.org/
MIT License
292 stars 107 forks source link

dashboard hung on file io #3790

Closed johrstrom closed 2 months ago

johrstrom commented 2 months ago

We're having issues at OSC with PUNs getting into bad states. Looking at a PUN that's been running for some time (over a week at this point). Running kill -3 on a process gave this stack trace where the dashboard is waiting on File.lstat to return.

App 1759505 output: # Thread: #<Thread:0x00007f45f3229a08 /opt/ood/ondemand/root/usr/share/gems/3.1/ondemand/3.1.7-1/gems/actionpack-6.1.7.6/lib/action_controller/metal/live.rb:300 sleep>(Worker 1), alive = true
App 1759505 output: ------------------------------------------------------------
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `lstat'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `lstat'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:46:in `initialize'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:97:in `new'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:97:in `block in ls'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `each'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `each'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `map'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/models/posix_file.rb:96:in `ls'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/controllers/files_controller.rb:36:in `block (2 levels) in fs'
App 1759505 output:     /opt/ood/ondemand/root/usr/share/gems/3.1/ondemand/3.1.7-1/gems/actionpack-6.1.7.6/lib/action_controller/metal/mime_responds.rb:214:in `respond_to'
App 1759505 output:     /var/www/ood/apps/sys/dashboard/app/controllers/files_controller.rb:18:in `fs'
johrstrom commented 2 months ago

Note that #3511 isn't causing this directly - but could have uncovered it. If the dashboard hangs, it's not likely to have anymore open files. At which point - the old implementation of lsof checking for apps would have indicated there are no running apps. And the PUN can be restarted.

But since we started to use ps, ps still sees this app as running and therefor won't stop the PUN.

johrstrom commented 2 months ago

I'm able to replicate this in dev when the project NFS drives are behind a firewall (i.e., any attempt to access them hangs forever).

However, if I do this work in another thread - I cannot stop/kill that thread. (doing the work in the main thread a ctrl+c does not stop the main thread). So I'm trying to work out how I can in fact kill a thread when it's in this state.

johrstrom commented 2 months ago

This is a duplicate of #240