cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
332 stars 93 forks source link

Optional outputs can cause legit partially-satisfied prereqiuisites #5729

Open hjoliver opened 1 year ago

hjoliver commented 1 year ago

We need to analyze partially satisfied prerequisites before using them as an excuse to stall. If some correspond to optional outputs, then partial satisfaction may not indicate an error condition.

Simplified version of user workflow:

[scheduling]
   [[graph]]
      R1 = """
         FAM:fail-all? => bad
         FAM:succeed-any? => good
     """
[runtime]
   [[FAM]]
   [[a]]
      inherit = FAM
      script = true
   [[b]]
      inherit = FAM
      script = false  # OK, graph says success is optional for me 
   [[bad]]
   [[good]]

Result:

INFO - [1/good running job:01 flows:1] => succeeded
WARNING - Partially satisfied prerequisites:
      * 1/bad is waiting on ['1/a:failed']
CRITICAL - Workflow stalled

The problem is: bad depends on multiple optional outputs, so we should not be surprised that it ends up partially satisfied.

Currently the only automatic solution is ... our old friend the suicide trigger 😠

"FAM:succeed-any? & FAM:finish-all? => !bad"

(We have to wait for all members of FAM to finish, because bad could be spawned after good triggers).

hjoliver commented 1 year ago

To refine this a bit, it's not specifically to do with family triggers.

"a? & b? => c  # with a fails and b succeeds"

or even:

"foo:x? & foo:y? => bar  # with :x generated but not :y"

This will stall with c (and bar) partially satisfied.

The graph says, trigger c only if both a and b succeed, and they are expected to fail sometimes - so that's not a reason to assume an error has occurred.

hjoliver commented 1 year ago

Assigned to 8.x milestone, but we should bring it forward it if possible - it's a bit nasty for users to understand and deal with.

oliver-sanders commented 7 months ago

Sticking the question label on this to flag for our next meeting as this will probably require some hashing out of details.

oliver-sanders commented 13 hours ago

Documented this limitation and the suicide trigger workaround in https://github.com/cylc/cylc-doc/pull/772

hjoliver commented 6 hours ago

Documenting this is good for now, but I still think, generally, we should not stall if the reason for the stall is ONLY unsatisfied optional outputs. It shouldn't need suicide triggers. I think from memory of the initial discussion about this @dpmatthews did not like this suggestion? If so, what's the reason?