flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
86 stars 40 forks source link

fatal job exception raised on pending jobs when reloading Fluxion modules #1215

Open grondo opened 4 months ago

grondo commented 4 months ago

While reloading fluxion on elcap, several pending jobs were canceled with a fatal job exception such as:

[Jun04 14:42] exception type="alloc" severity=0 note="alloc denied due to type=\"match error\"" userid=765
[  +0.000608] clean

For reference, here's the logs at the time of module reload:

[Jun04 14:42] broker[0]: rmmod sched-fluxion-resource
[ +14.008927] sched-fluxion-resource[0]: responding to post-shutdown sched-fluxion-resource.cancel
[ +14.009019] broker[0]: module sched-fluxion-resource exited
[ +14.012128] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.014486] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.015532] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.045507] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.087013] broker[0]: rmmod resource
[ +14.087290] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.103970] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.104489] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.104968] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.105501] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.105973] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.106463] sched-fluxion-qmanager[0]: check_watcher_cb: run_sched_loop: Function not implemented
[ +14.122417] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122435] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122442] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122447] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122451] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122456] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122461] sched-fluxion-qmanager[0]: responding to post-shutdown sched.ping
[ +14.122465] sched-fluxion-qmanager[0]: responding to post-shutdown sched.disconnect
[ +14.122469] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122474] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122479] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122483] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122488] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.ping
[ +14.122492] sched-fluxion-qmanager[0]: responding to post-shutdown sched-fluxion-qmanager.disconnect
[ +14.122496] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122500] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122505] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122510] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122514] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122518] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122529] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122534] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122538] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122543] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122546] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122550] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122554] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122558] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122563] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122580] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122585] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122590] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122594] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122599] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122603] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122608] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122612] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122635] sched-fluxion-qmanager[0]: responding to post-shutdown sched.cancel
[ +14.122639] sched-fluxion-qmanager[0]: responding to post-shutdown sched.cancel
[ +14.122642] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122648] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.122652] sched-fluxion-qmanager[0]: responding to post-shutdown sched.free
[ +14.139690] broker[0]: module sched-fluxion-qmanager exited
[ +14.139745] job-manager[0]: alloc: stop due to disconnect: Success
grondo commented 4 months ago

Note that in this particular case, we had to kill off flux module remove sched-fluxion-qmanager which was hanging due to the leaked alloc requests issue (can't find the issue right now, feel free to link it here if found)