Open eirrgang opened 3 years ago
I am not sure I understand this issue. What would two tasks solve which could not be addressed by a single task?
IIRC, this was about being able to stage output on failed tasks - that is an issue which will be addressed in RP (ETA: 1st week of May 2021). What other issues do we need to address?
I believe the two-tasks option is obviated by the stage_on_error
option, at least for traditional executables.
We still need to figure out how to package and handle exceptions from Python callables and the surrounding framework.
@andre-merzky can you comment on this?
We should attach a callback to the master (scheduler) Task and monitor an Event that is updated for the master task transition rp.FINAL. If the Task ever made it to rp.states.AGENT_EXECUTING, we can deduce the progress of the master script by examining artifacts we have told it to produce. (If it does not make it as far as EXECUTING, I'm not sure there is anything programmatic we can do to determine what went wrong. Thoughts?)
This issue will be more tractable after some work planned by @andre-merzky for later this month. At this time, we do not have an effective way to manage the end of the Worker or Master life cycles (either normal or error circumstances).
Per discussion between @andre-merzky and @eirrgang
Master performs custom dispatching when it receives a Master.worker_state_cb(worker_dict, state)
with a state of FAILED (or CANCELED, if unexpected):
worker_dict["task_sandbox_path"]
(a local filesystem path that does not need URL processing) in order to look for custom bookkeeping information.Master.result_cb()
.advance()
those tasks to FAILED
:
Master.advance(tasks, rps.FAILED, publish=True, push=False)
How do we propagate error conditions from the runtime to the client for processing? (e.g. dependency resolution, transient vs. permanent failures, complete (non-truncated) error output, etc.)
From discussion on Slack, we seem to like the idea of submitting (at least) two RP tasks for each workflow item:Examples
Some common scenarios that need a clear strategy for detecting and debugging
Master task
The "raptor" scope is created by submitting a rp.Task that runs a
radical.pilot.raptor.Master
derivative. The uid is used as the scheduler value for Tasks containing a payload that should be passed toMaster.request_cb()
.Master.result_cb()
will be called for eachraptor.Request
that completes.If no errors occur while https://github.com/SCALE-MS/scale-ms/blob/master/src/scalems/radical/scalems_rp_master.py is executing the raptor protocol (
master.submit(...); master.start(); master.join(); master.stop()
) then the master task will end normally. (Pending resolution of #88, at least) we need to check whether Master is running as expected.We are not currently checking that the Pilot has started, reached a particular state, and has not errored out. We assume we will detect such failures during
pilot_manager.submit_pilots
,task_manager.add_pilots
, ortask_manager.submit_tasks
.We can wait for the Master task to reach
[rp.states.AGENT_EXECUTING] + rp.FINAL
before submitting raptor tasks, but the master task could reachrp.states.AGENT_EXECUTING
and then immediately fail due to a bug in the script or an invalid execution environment.We should attach a callback to the master (scheduler) Task and monitor an Event that is updated for the master task transition rp.FINAL. If the Task ever made it to
rp.states.AGENT_EXECUTING
, we can deduce the progress of the master script by examining artifacts we have told it to produce. (If it does not make it as far as EXECUTING, I'm not sure there is anything programmatic we can do to determine what went wrong. Thoughts?)We should check for (or be ready to handle an Event) at least as often as we update local or remote workflow state (such as submitting or checking on Tasks), so that we can start cleaning up a failed dispatching session as early and cleanly as possible.
We can prepopulate the master script with a series of tests to perform to determine compatibility of the remote RCT stack and
scalems
installation. During workload processing, we can perform additional checks on the targeted venv or handle failed Workers.In the near term, we can allow all Worker failures to lead to termination of the Master so that we can stage out information about the error.
stage_on_error
has been updated to stage all the way to the client, but only inproject/scalems
branch andscalems/stable
tag.In the medium term, we will need to allow additional communication to the client during execution in order to make optimizations based on opportunistic data locality or Worker reusability, to allow the Master to asynchronously stage checkpoint data or early results, to allow the Master to re-provision Workers in collaboration with client-side work load shaping, or, generally, to allow for communications patterns that do not fit the Request scheme. With such facilities, we will also be able to handle problems like import errors or unexpectedly missing data, or otherwise just continue to execute as many Tasks as the data flow topology allows, without exiting the Master on each occasion. (For instance, we could conceivably provision a new venv while the master is running or re-transfer data that appears corrupted in order to resubmit Tasks that failed due to missing packages, or just generally handle adaptivity that allows for a workflow element failure as a non-fatal part of ordinary workflow execution.)
Worker Task
result_cb
).) Depends in part on #165The Worker lifetime management is not complete at this time, I don't think. RADICAL efforts will likely be traceable through https://github.com/radical-cybertools/radical.pilot/issues/2643