Overall workflow state / completed state

cylc / cylc-flow

Cylc: a workflow engine for cycling systems.

https://cylc.github.io

GNU General Public License v3.0

335 stars 94 forks source link

Overall workflow state / completed state #5701

Open TomekTrzeciak opened 1 year ago

TomekTrzeciak commented 1 year ago

Problem

Cylc has a clear concept of task and job states, but less so when it comes to the overall workflow state. For example, once the workflow has stopped, there is no easy way to tell the underlying reason without digging through the logs or database. In particular, for non-cycling workflows or ones with finite number of cycles it would be useful to easily tell apart normal termination (workflow reached and completed the final cycle) from abnormal one (stalled, server crash, ...). Chatting to @oliver-sanders about it, this seems to be also a prerequisite for having proper support for subworkflow as a task in the future (couldn't find a specific issue for it).

Proposed Solution

A possible solution could be to add a workflow-wide status file akin to job.status that can be scanned for and interrogated for information.

oliver-sanders commented 1 year ago

For sub-workflows, we can currently use the workflow's exit code which kinda works, however, with this it is hard to tell the difference between a stopped workflow and a completed workflow.

We could add a new top-level workflow status for "completed" workflows. Currently this state can be effectively detected by querying the task-pool table in the database, if there are no entries, then the workflow has completed.

hjoliver commented 1 year ago

For sub-workflows, we can currently use the workflow's exit code which kinda works, however, with this it is hard to tell the difference between a stopped workflow and a completed workflow.

My sub-workflow example notes this, and addresses it by having the sub-workflow launch script (for the launcher task in the main workflow) check the DB for completion of a known final task in the sub-workflow:

# sub-workflow stopped, but did it succeed?
cylc workflow-state \
    --max-polls=1 \
    --task=${SUBWF_END_TASK#*/} \
    --point=${SUBWF_END_TASK%/*} \
    --status=succeeded \
    $SUBWF_ID

However, your suggestion to use the task pool table is an improvement 🎉 I'll amend my example and alert the couple of NIWA teams with sub-workflow use-cases.

Also, a new top-level workflow status for "completed" is a good idea.

oliver-sanders commented 3 months ago

It would be a good idea to make accessing the "complete" status as easy as possible as this is something that tools like cylc scan will need to do.

Ideally we wouldn't need to go to the database at all (managing database connections is hassle), perhaps a .service file or field thereof?

oliver-sanders commented 3 months ago