SCALE-MS / scale-ms

SCALE-MS design and development
GNU Lesser General Public License v2.1
4 stars 4 forks source link

Run time error handling scheme #92

Open eirrgang opened 3 years ago

eirrgang commented 3 years ago

How do we propagate error conditions from the runtime to the client for processing? (e.g. dependency resolution, transient vs. permanent failures, complete (non-truncated) error output, etc.)

From discussion on Slack, we seem to like the idea of submitting (at least) two RP tasks for each workflow item:

Examples

Some common scenarios that need a clear strategy for detecting and debugging

Master task

The "raptor" scope is created by submitting a rp.Task that runs a radical.pilot.raptor.Master derivative. The uid is used as the scheduler value for Tasks containing a payload that should be passed to Master.request_cb(). Master.result_cb() will be called for each raptor.Request that completes.

If no errors occur while https://github.com/SCALE-MS/scale-ms/blob/master/src/scalems/radical/scalems_rp_master.py is executing the raptor protocol (master.submit(...); master.start(); master.join(); master.stop()) then the master task will end normally. (Pending resolution of #88, at least) we need to check whether Master is running as expected.

We are not currently checking that the Pilot has started, reached a particular state, and has not errored out. We assume we will detect such failures during pilot_manager.submit_pilots, task_manager.add_pilots, or task_manager.submit_tasks.

We can wait for the Master task to reach [rp.states.AGENT_EXECUTING] + rp.FINAL before submitting raptor tasks, but the master task could reach rp.states.AGENT_EXECUTING and then immediately fail due to a bug in the script or an invalid execution environment.

We should attach a callback to the master (scheduler) Task and monitor an Event that is updated for the master task transition rp.FINAL. If the Task ever made it to rp.states.AGENT_EXECUTING, we can deduce the progress of the master script by examining artifacts we have told it to produce. (If it does not make it as far as EXECUTING, I'm not sure there is anything programmatic we can do to determine what went wrong. Thoughts?)

We should check for (or be ready to handle an Event) at least as often as we update local or remote workflow state (such as submitting or checking on Tasks), so that we can start cleaning up a failed dispatching session as early and cleanly as possible.

We can prepopulate the master script with a series of tests to perform to determine compatibility of the remote RCT stack and scalems installation. During workload processing, we can perform additional checks on the targeted venv or handle failed Workers.

In the near term, we can allow all Worker failures to lead to termination of the Master so that we can stage out information about the error.

stage_on_error has been updated to stage all the way to the client, but only in project/scalems branch and scalems/stable tag.

In the medium term, we will need to allow additional communication to the client during execution in order to make optimizations based on opportunistic data locality or Worker reusability, to allow the Master to asynchronously stage checkpoint data or early results, to allow the Master to re-provision Workers in collaboration with client-side work load shaping, or, generally, to allow for communications patterns that do not fit the Request scheme. With such facilities, we will also be able to handle problems like import errors or unexpectedly missing data, or otherwise just continue to execute as many Tasks as the data flow topology allows, without exiting the Master on each occasion. (For instance, we could conceivably provision a new venv while the master is running or re-transfer data that appears corrupted in order to resubmit Tasks that failed due to missing packages, or just generally handle adaptivity that allows for a workflow element failure as a non-fatal part of ordinary workflow execution.)

Worker Task

The Worker lifetime management is not complete at this time, I don't think. RADICAL efforts will likely be traceable through https://github.com/radical-cybertools/radical.pilot/issues/2643

andre-merzky commented 3 years ago

I am not sure I understand this issue. What would two tasks solve which could not be addressed by a single task?

IIRC, this was about being able to stage output on failed tasks - that is an issue which will be addressed in RP (ETA: 1st week of May 2021). What other issues do we need to address?

eirrgang commented 3 years ago

I believe the two-tasks option is obviated by the stage_on_error option, at least for traditional executables.

We still need to figure out how to package and handle exceptions from Python callables and the surrounding framework.

eirrgang commented 3 years ago

@andre-merzky can you comment on this?

We should attach a callback to the master (scheduler) Task and monitor an Event that is updated for the master task transition rp.FINAL. If the Task ever made it to rp.states.AGENT_EXECUTING, we can deduce the progress of the master script by examining artifacts we have told it to produce. (If it does not make it as far as EXECUTING, I'm not sure there is anything programmatic we can do to determine what went wrong. Thoughts?)

eirrgang commented 1 year ago

This issue will be more tractable after some work planned by @andre-merzky for later this month. At this time, we do not have an effective way to manage the end of the Worker or Master life cycles (either normal or error circumstances).

eirrgang commented 1 year ago

scenario: worker fails while handling a task

Per discussion between @andre-merzky and @eirrgang

Master performs custom dispatching when it receives a Master.worker_state_cb(worker_dict, state) with a state of FAILED (or CANCELED, if unexpected):

  1. get the worker_dict["task_sandbox_path"] (a local filesystem path that does not need URL processing) in order to look for custom bookkeeping information.
  2. compare the Worker metadata with the Master's metadata to identify Tasks which have been popped off of the scheduling queue, but for which we don't ever expect to see a Master.result_cb().
  3. advance() those tasks to FAILED: Master.advance(tasks, rps.FAILED, publish=True, push=False)
  4. Reconcile workflow metadata.
  5. Apply custom logic regarding launching new Workers or resubmitting Tasks, etc.

Note: Be careful to avoid/handle race conditions in Task completion