Open oliver-sanders opened 1 year ago
:tada: yes this would be a great improvement!
(We did not have the option to use asyncio when the current system was devised, of course).
I had a look at this today as a training exercise, conclusions:
zmq
queues and communicated updates from the submission loop back using task messages (i.e. send the task message preparing
rather than itask.state_reset(TASK_STATUS_PREPARING)
, then we could actually run the submission loop on another host if we wanted to which is an interesting potential.Here's how the top-level code would look after this change:
async def submitter(bad_hosts, submission_queue, subprocpool):
"""Job submitter thinggy.
When you want jobs to be submitted, push the corresponding tasks
into the submission_queue and lean back, the submitter does the work
for you.
"""
cache = SelectCache()
while True:
for itask in submission_queue.get()
try:
await _submit(cache, itask, bad_hosts, subprocpool)
except (JobSyntaxError, PlatformError, SubmissionError):
# TODO: => submit-fail
else:
# TODO: => submitted
async def _submit(cache, itask, bad_hosts, subprocpool):
"""The job submission pipeline for a single task."""
rtconfig = _get_rtconfig(itask, broadcast_mgr)
select_host = await _get_host_host_selector(cache, itask, rtconfig, bad_hosts, subprocpool)
for platform, host in select_host:
try:
await check_syntax(itask, platform, host)
await remote_init(cache, platform, host)
await remote_file_install(cache, platform, host)
await submit_job(cache, itask, platform, host)
break
except SSHError:
# LOG.warning()
bad_hosts.add(host)
continue
else:
raise PlatformError(f'no hosts available for {platform}')
We can keep batching behaviour by maintaining mappings as we currently do. E.G. the remote-init map logic can be handled like this (note that the caching part of the logic has been removed from the actual implementation of remote-init itself):
async def remote_init(cache, platform, host):
with suppress(KeyError):
return await cache.remote_init_cache[platform['install target']]
coro = _remote_init(platform, host)
cache.remote_init_cache[platform['install target']] = coro
return await coro
Batching the job-submit commands is a little funkier, but I think we might be able to do it something like this:
async def submit_job(cache, itask, platform, host):
"""Run the job submission command."""
key = (platform['name'], host)
with suppress(KeyError):
cache.job_submission_queue.append(itask)
return await cache.job_submission_queue[key]
cache.job_submission_queue[key] = [itask]
# wait for other submissions to queue up behind us
await asyncio.sleep(0)
# submit the batch
_submit_jobs(cache.job_submission_queue, platform, host)
So that we can continue to write the code from the perspective of a single job submission, rather than having to define this batching in the top-level code.
https://github.com/oliver-sanders/cylc-flow/pull/new/async-subproc-pool-early-experiment
Very nice!
The subprocpool allows us to run subprocesses asynchronously in a limited pool to avoid overwhelming the scheduler host.
Proposed Change
Re-write the subprocpool as
async
code so we canawait
sub process calls from the code which issued them.If required it would be possible to support both a call/callback and an async/await interface to the new pool.
I considered doing this during Cylc 8 development, however, at the time I figured this wasn't necessary and would bloat the project. In retrospect it would probably have saved time...
Why
The subprocpool was implemented in Cylc 7 using a call/callback pattern. Code that works with the subprocpool must issue commands, then, in a future main loop iteration check to see whether the callback has been fired. This is a slightly icky pattern but it's worked ok, up until we needed to add remote-init/remote-fileinstall/platform-selection/host-selection into the job submission pipeline.
Conceptually this is simple, here's an outline of how the job-submission / intelligent host-selection works:
But when you try to write this using a call/callback pattern it gets messy. The actual code is much, much longer and spread between multiple functions (which issue the calls) and callbacks (which update the shared state).
(Minor sidenote it also separates you from the subprocess.Popen object which would be really handy to have).
How
The
async
/await
interfaces absorb the call/callback and state management aspects by abstracting them away.This leaves us to write the business logic which is fairly straight forward, here's an example implementation:
Long Term Context
This would help us to break the main loop lockstep: https://github.com/cylc/cylc-flow/issues/3498
Submissions would go through faster and the load/complexity of the main loop would be reduced.
Decoupling job submission from the main loop would simplify the logic but also provide us with a lot more flexility over how we run job submission. E.G. we could push it out into another process or (using zmq queues) even push it onto another host (lightweight submission on another host controlled by the remote scheduler, i.e. very interesting cloud possibilities).
Pull requests welcome!