SCALE-MS / scale-ms

SCALE-MS design and development
GNU Lesser General Public License v2.1
4 stars 4 forks source link

shutdown failure mode needs refinement #340

Open eirrgang opened 1 year ago

eirrgang commented 1 year ago

We have a report that scalems may hang at shut down when certain sorts of errors occur. @eirrgang will investigate the following situation.

2023-04-26 23:10:03,830-scalems.radical.runtime:1520-DEBUG - Received cancellation in watcher task for <Task object, uid rp-basic-ensemble-0>
2023-04-26 23:10:03,847-scalems.radical.runtime:1522-DEBUG - Propagating cancellation to <Task object, uid rp-basic-ensemble-0>.
2023-04-26 23:10:04,046-asyncio:1900-WARNING - Executing <Task cancelled name='Task-29' coro=<_rp_task_watcher() done, defined at /ocean/projects/cts160011p/wehs7661/scale-ms/src/scalems/radical/runtime.py:1428> created at /jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/tasks.py:361> took 0.216 seconds
2023-04-26 23:10:04,046-scalems.radical.runtime:1520-DEBUG - Received cancellation in watcher task for <Task object, uid rp-basic-ensemble-1>
2023-04-26 23:10:04,046-scalems.radical.runtime:1522-DEBUG - Propagating cancellation to <Task object, uid rp-basic-ensemble-1>.
2023-04-26 23:10:04,179-asyncio:1900-WARNING - Executing <Task cancelled name='Task-32' coro=<_rp_task_watcher() done, defined at /ocean/projects/cts160011p/wehs7661/scale-ms/src/scalems/radical/runtime.py:1428> created at /jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/tasks.py:361> took 0.132 seconds
2023-04-26 23:10:04,181-scalems.workflow:1043-DEBUG - Stopping the SCALEMS RP dispatching queue runner.
2023-04-26 23:10:04,183-scalems.execution:390-DEBUG - Stopping workflow execution.
2023-04-26 23:10:04,183-scalems.execution:414-ERROR - Command queue never emptied.
2023-04-26 23:10:04,183-scalems.execution:417-DEBUG - No tasks to wait for. Continuing to shut down.
2023-04-26 23:10:04,183-scalems.execution:428-DEBUG - RPDispatchingExecutor context manager received cancellation while exiting.
2023-04-26 23:10:04,183-scalems.radical.runtime:1257-DEBUG - Final state: None
2023-04-26 23:10:04,184-scalems.radical.runtime:1258-INFO - Master scheduling task state FAILED: <Task object, uid scalems-rp-master.6a872ebe-e4a7-11ed-91e7-6805cabe850e>.
2023-04-26 23:10:04,184-scalems.radical.runtime:1266-DEBUG - [... CONTENT SHORTENED ...]
BUG - Completed master task.
Traceback (most recent call last):
  File "/jet/home/wehs7661/rp-venv/bin/scalems_rp_master", line 8, in <module>
    sys.exit(master())
  File "/ocean/projects/cts160011p/wehs7661/scale-ms/src/scalems/radical/raptor/__init__.py", line 885, in master
    _master.submit_workers(descriptions=configs)
TypeError: submit_workers() got an unexpected keyword argument 'descriptions'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[18979,1],0]
  Exit code:    1
--------------------------------------------------------------------------

2023-04-26 23:10:04,184-scalems.radical.runtime:1275-DEBUG - Closing Session rp.session.br012.ib.bridges2.psc.edu.wehs7661.019474.0004.
closing session rp.session.br012.ib.bridges2.psc.edu.wehs7661.019474.0004      \
close task manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
        O/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/-\|/2023-04-26 23:10:11,146-asyncio:670-DEBUG - Close <_UnixSelectorEventLoop running=False closed=False debug=True>
2023-04-26 23:10:11,146-scalems.workflow:1246-DEBUG - Leaving scope of <scalems.workflow.WorkflowManager object at 0x7f61ae616760>
Task done: <Task cancelled name='rp-basic-ensemble-0' coro=<MDRun.result() done, defined at /ocean/projects/cts160011p/wehs7661/workshop/examples/basic_ensemble/rp_basic_ensemble.py:212> created at /jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/tasks.py:361>.
Task done: <Task cancelled name='rp-basic-ensemble-1' coro=<MDRun.result() done, defined at /ocean/projects/cts160011p/wehs7661/workshop/examples/basic_ensemble/rp_basic_ensemble.py:212> created at /jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/tasks.py:361>.
Traceback (most recent call last):
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/base_events.py", line 1869, in _run_once
    event_list = self._selector.select(timeout)
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ocean/projects/cts160011p/wehs7661/scale-ms/src/scalems/workflow/__init__.py", line 1242, in scope
    yield manager
  File "/ocean/projects/cts160011p/wehs7661/workshop/examples/basic_ensemble/rp_basic_ensemble.py", line 319, in <module>
    md_outputs = asyncio.run(
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/runners.py", line 47, in run
    _cancel_all_tasks(loop)
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/runners.py", line 63, in _cancel_all_tasks
    loop.run_until_complete(
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/asyncio/base_events.py", line 1869, in _run_once
    event_list = self._selector.select(timeout)
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ocean/projects/cts160011p/wehs7661/workshop/examples/basic_ensemble/rp_basic_ensemble.py", line 319, in <module>
    md_outputs = asyncio.run(
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/ocean/projects/cts160011p/wehs7661/scale-ms/src/scalems/workflow/__init__.py", line 1249, in scope
    _close()
  File "/ocean/projects/cts160011p/wehs7661/scale-ms/src/scalems/workflow/__init__.py", line 448, in close
    raise APIError("Cannot close a WorkflowManager that is actively dispatching work.")
scalems.exceptions.APIError: Cannot close a WorkflowManager that is actively dispatching work.
-\Exception ignored in: <module 'threading' from '/jet/home/wehs7661/.local/python3.9/lib/python3.9/threading.py'>
Traceback (most recent call last):
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/threading.py", line 1440, in _shutdown
    atexit_call()
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/threading.py", line 1053, in join
    self._wait_for_tstate_lock()
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/threading.py", line 1073, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt: 
|/Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/jet/home/wehs7661/rp-venv/lib/python3.9/site-packages/pymongo/periodic_executor.py", line 99, in join
    self._thread.join(timeout)
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/threading.py", line 1057, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/jet/home/wehs7661/.local/python3.9/lib/python3.9/threading.py", line 1073, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt
eirrgang commented 1 year ago

In this case there were multiple problems, including mismatched package versions.

At some point we should go through and make sure that we don't have any lock.aquire or wait() sorts of calls that can be allowed to block indefinitely.

In particular, during a shutdown or once a fatal error has occurred, we should consider how to guarantee that we shut down in a finite amount of time.