SCALE-MS / scale-ms

SCALE-MS design and development
GNU Lesser General Public License v2.1
4 stars 4 forks source link

scalems.call on raptor backend #326

Closed eirrgang closed 1 year ago

eirrgang commented 1 year ago

This issue tracks the scalems package aspect of an issue in the workshop repository.

eirrgang commented 1 year ago

Worker._dispatch_proc() is broken in RP 1.21.0, so a lateral move is not possible using the TASK_PROC mode. I will discuss options with @andre-merzky and @mturilli today.

update

  1. the issue is resolved
  2. we won't be using TASK_PROC
eirrgang commented 1 year ago

Additional constraints on resource allocation

Pending further discussion (#302), we leave it as an exercise to the user to provision a Pilot that is adequate for the tasks to be submitted. Dispatching through raptor has a slight additional burden and warrants some updates to the scalems raptor lifetime management.

By the time the scalems.call.function_call_to_subprocess() call is made, the Worker(s) may have already started. By the time scalems.radical.runtime.subprocess_to_rp_task() executes, the Worker(s) has definitely started. We need to split up the Worker launch from the Master launch and inspect the work load to decide how to provision the Worker(s).

As a first step, though, to facilitate the lateral move of scalems.call, we can provision one Worker with N-1 cores and raise an error if the submitted Task is incompatible.

The follow-up should rely on the new raptor protocol that @andre-merzky is working on, if at all possible, to manage Workers, or @eirrgang will be performing completely redundant work that is immediately obsolete.

The biggest short-term impact will be lack of flexibility with cores allocated to (OpenMP) threads versus ranks.

See also #302

Additional notes from design discussion

Resource constraints:

Other set-up details:

eirrgang commented 1 year ago

scalems.call was a workaround that wrapped a serialized function call into a command line executable task for dispatching through traditional RP executable Task execution. This was pursued to give us a chance to move forward with other development while refining raptor.

There does not appear to be a good way to simply port scalems.call to raptor. We don't have to disable scalems.call completely, but we cannot simply dispatch the same workflow script to be executed on raptor. The function_call_to_subprocess() sequence of calls just don't make sense in the raptor context.

eirrgang commented 1 year ago

update: We should be able to salvage this with TASK_EXECUTABLE mode. The raptor master should be able to manage such a task without a worker.