Open oliver-sanders opened 9 months ago
Looked into named pipes as a training exercise and gave this a go. It worked fine, but named pipe don't work across hosts so unfortunately that was a dead end.
Supporting file system events is probably much easier than finding a file system + kernel combination that supports this pattern so that's probably a dead end too.
However, I swapped out the named pipes for a simple file poller that calls "readline" to check for new lines. This seems to work pretty well and is still substantially more efficient than Cylc's existing poller implementation. Implementation was surprisingly easy:
https://github.com/oliver-sanders/cylc-flow/pull/new/local-job-poller
Note, this doesn't replace the existing task polling logic which can continue to run alongside. It replaces push messaging (i.e. zmq
or ssh+zmq
).
It's currently running readline
on each status file every second (which is ~ every main loop iteration). Will need to test in anger, but I suspect that this will put fairly minimal load on the filesystem (it's only trying to read one line not the whole file). The rate can be lowered a bit, and the pollers could be pushed into their own process if performance is a concern.
TODO:
One of the challenges facing containerised deployment of Cylc is the requirement for Cylc to be installed in the job environment.
This problem (and the Cylc networking requirements around it) have proven to be a pain point in cloud deployments, especially for those not familiar with Cylc.
This is also a bit awkward as it means you can't just use off-the-shelf containers to run your jobs in as you must install Cylc into the container. This is a lot of extra work, e.g. install mamba, create environment, install environment, install wrapper script, etc. This also bloats the container size which isn't great. Ideally the Cylc comms mechanics would be separate from job environments so the workflow writers can focus on the execution environments and the sys admins can focus on the Cylc infrastructure.
Two possible solutions to address this problem:
cylc remote-init
,cylc message
andcylc broadcast
)cylc
client to write messages to thejob.status
file, have the job script write to the file directly.job.status
file via an environment variable so the job can write custom messages to it.cylc jobs-poll
commands.Option 1 is more sophisticated as it would open up access to the full GraphQL API, however, still requires Cylc to be installed on the container.
Option 2, especially with a long-lived scheduler-side poller process is starting to look like an attractive solution to me. Essentially just an extra process which maintains a list of the
job.status
file paths of active tasks and either registers filesystem events (for push notifications) or simply polls them (for pull notifications) on much shorter timeframes than generally done with conventional Cylc task polling. The poller process would queue messages via the GraphQL interface (over ZMQ) when new messages are detected so would be completely asynchronous to the scheduler's main loop (no subprocpool burden).