OSC / ood-activejobs

[MOVED] Active Jobs provides details of scheduled jobs on an HPC cluster.
https://osc.github.io/Open-OnDemand/
MIT License
0 stars 1 forks source link

Slurm 18.08 support #169

Closed kcgthb closed 5 years ago

kcgthb commented 5 years ago

Hi there!

We just upgraded our Slurm version to 18.08.3, and it looks like some command output formatting may have changed and broke the listing of jobs.

The active job list is now empty and shows an error message: "undefined method `split' for nil:NilClass"

image

Could you please point us out to where we should look to fix this? Thanks!

kcgthb commented 5 years ago

Actually, it looks like it also break job submission. The following screenshot is what happens when submitting a job via our Jupyter interactive app: image

The full stack is:

#<NoMethodError: undefined method `split' for nil:NilClass>

/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/ood_core-0.5.1/lib/ood_core/job/adapters/slurm.rb:429:in `duration_in_seconds'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/ood_core-0.5.1/lib/ood_core/job/adapters/slurm.rb:484:in `parse_job_info'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/ood_core-0.5.1/lib/ood_core/job/adapters/slurm.rb:334:in `block in info'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/ood_core-0.5.1/lib/ood_core/job/adapters/slurm.rb:333:in `map'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/ood_core-0.5.1/lib/ood_core/job/adapters/slurm.rb:333:in `info'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:270:in `update_info'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:264:in `info'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:258:in `status'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:316:in `completed?'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:88:in `block in all'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:87:in `map'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:87:in `all'
/var/www/ood/apps/sys/dashboard/app/controllers/batch_connect/sessions_controller.rb:7:in `index'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_controller/metal/implicit_render.rb:4:in `send_action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/abstract_controller/base.rb:198:in `process_action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_controller/metal/rendering.rb:10:in `process_action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/abstract_controller/callbacks.rb:20:in `block in process_action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:117:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:555:in `block (2 levels) in compile'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:505:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:92:in `__run_callbacks__'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:778:in `_run_process_action_callbacks'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:81:in `run_callbacks'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/abstract_controller/callbacks.rb:19:in `process_action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_controller/metal/rescue.rb:29:in `process_action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_controller/metal/instrumentation.rb:32:in `block in process_action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/notifications.rb:164:in `block in instrument'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/notifications/instrumenter.rb:20:in `instrument'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/notifications.rb:164:in `instrument'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_controller/metal/instrumentation.rb:30:in `process_action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_controller/metal/params_wrapper.rb:250:in `process_action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/abstract_controller/base.rb:137:in `process'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionview-4.2.10/lib/action_view/rendering.rb:30:in `process'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_controller/metal.rb:196:in `dispatch'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_controller/metal/rack_delegation.rb:13:in `dispatch'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_controller/metal.rb:237:in `block in action'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/routing/route_set.rb:74:in `dispatch'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/routing/route_set.rb:43:in `serve'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/journey/router.rb:43:in `block in serve'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/journey/router.rb:30:in `each'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/journey/router.rb:30:in `serve'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/routing/route_set.rb:817:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/rack-1.6.10/lib/rack/etag.rb:24:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/rack-1.6.10/lib/rack/conditionalget.rb:25:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/rack-1.6.10/lib/rack/head.rb:13:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/params_parser.rb:27:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/flash.rb:260:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/rack-1.6.10/lib/rack/session/abstract/id.rb:225:in `context'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/rack-1.6.10/lib/rack/session/abstract/id.rb:220:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/cookies.rb:560:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/callbacks.rb:29:in `block in call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:88:in `__run_callbacks__'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:778:in `_run_call_callbacks'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/callbacks.rb:81:in `run_callbacks'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/callbacks.rb:27:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/remote_ip.rb:78:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/debug_exceptions.rb:17:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/show_exceptions.rb:30:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/lograge-0.10.0/lib/lograge/rails_ext/rack/logger.rb:15:in `call_app'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/railties-4.2.10/lib/rails/rack/logger.rb:20:in `block in call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/tagged_logging.rb:68:in `block in tagged'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/tagged_logging.rb:26:in `tagged'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/tagged_logging.rb:68:in `tagged'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/railties-4.2.10/lib/rails/rack/logger.rb:20:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/request_store-1.4.1/lib/request_store/middleware.rb:19:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/request_id.rb:21:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/rack-1.6.10/lib/rack/methodoverride.rb:22:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/rack-1.6.10/lib/rack/runtime.rb:18:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.10/lib/active_support/cache/strategy/local_cache_middleware.rb:28:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.10/lib/action_dispatch/middleware/static.rb:120:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/rack-1.6.10/lib/rack/sendfile.rb:113:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/railties-4.2.10/lib/rails/engine.rb:518:in `call'
/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/railties-4.2.10/lib/rails/application.rb:165:in `call'
/opt/rh/rh-passenger40/root/usr/share/passenger/phusion_passenger/rack/thread_handler_extension.rb:74:in `process_request'
/opt/rh/rh-passenger40/root/usr/share/passenger/phusion_passenger/request_handler/thread_handler.rb:141:in `accept_and_process_next_request'
/opt/rh/rh-passenger40/root/usr/share/passenger/phusion_passenger/request_handler/thread_handler.rb:109:in `main_loop'
/opt/rh/rh-passenger40/root/usr/share/passenger/phusion_passenger/request_handler.rb:455:in `block (3 levels) in start_threads'
ericfranz commented 5 years ago

Slurm 17 had a gres column in the output, where as Slurm 18 does not.

This fix is a work in progress https://github.com/OSC/ood_core/pull/105. Right now the approach is to specify a version in the cluster config, and if its >=18 we remove gres from the output we are parsing. I wonder if there is a more robust/safe way to parse the output so that if this changes again in Slurm 19 we can handle that, or if its just something we will have to deal with. The 1.4 release will have this fix included and I think we are now targeting for end of next week or the week after.

If you need the fix earlier there are two options:

  1. If you are using the /latest/ branch of rpms for OnDemand, we will be releasing a new one Friday or early next week. Note that will also have an upgrade to Passenger and NGINX so may require an extra installation step.
  2. If you need a fix now I can provide the small monkey patch that you can add to your /etc/ood/config/ directory in three places (for dashboard, active jobs, and job composer) to make those work again. Let me know.
kcgthb commented 5 years ago

Hi Eric, Thanks a lot for the info! A quick patch would be awesome to restore functionality before the next RPM release.

Thanks!

MorganRodgers commented 5 years ago

@kcgthb attached is a script that will patch this issue. As Eric noted, our next release of OOD will contain a proper fix.

To patch run: sudo bash patch_slurm18.sh.txt patch To remove the patch: sudo bash patch_slurm18.sh.txt unpatch

patch_slurm18.sh.txt

kcgthb commented 5 years ago

@MorganRodgers Thanks a lot, that works perfectly!

MorganRodgers commented 5 years ago

@kcgthb FYI I just discovered that the source of this problem appears to be a bug in Slurm: https://github.com/OSC/ood_core/pull/105#issuecomment-442975956

kcgthb commented 5 years ago

Right, we noticed that yesterday too, and a colleague of mine reported it at https://bugs.schedmd.com/show_bug.cgi?id=6120

MorganRodgers commented 5 years ago

Look at that. I missed that when searching for issues. If you all are into patching Slurm @treydock just posted a patch which fixes this problem: https://bugs.schedmd.com/show_bug.cgi?id=6141.