Open sadielbartholomew opened 5 years ago
@sadielbartholomew - see my comments in #3232
If an xtrigger function returns an illegal result, it is technically broken and your suite is presumably going to stall with tasks waiting on it. (Because we - correctly - don't treat a broken xtrigger as "satisfied").
So what to do about that? We could:
Number 2. may be preferable so long as you can fix the trigger function in the live suite, but I don't think that 1. is "strange behaviour" :grin:
(Sorry, I see your "strange behaviour" comment doesn't relate to the suite shutdown).
Re-reading more carefully (sorry)! I guess we need to distinguish between "broken" xtrigger functions that return totally illegal results (i.e. anything but (T/F, dict)
) and ones that return success (T, dict)
but with an illegal results dict.
In the latter case, it is correct that the downstream task triggered (because the trigger was satisfied) but the illegal results dict caused problems.
I ran into this as well while trying to write some custom xtrigger functions. In my case, the suite didn't stall because the tasks were waiting on it (as @hjoliver) but rather crashed in a json decode that led to the suite controller itself becoming unresponsive - probably not the intended behavior. I'm OK with the second option above, but having an xtrigger function return totally illegal results shouldn't freeze things up.
Thanks for reporting your observations @trwhitcomb. That is definitely a prompt for us to investigate further. In general I agree that a bad dict
in (T, dict)
should not cause an unclean suite shut down, & certainly not a freeze, but @hjoliver's outlined case:
- have the suite shut down cleanly, with warnings/errors about the broken function
is correct, so would not be a bug. Actually when I wrote up the Issue (probably only very clearly, sorry!) as well as the shutdown it appeared that they were tasks emerging or persisting in states that were not right, as depicted somewhat in the Cylc Review screenshot, but I still need to the bottom of the 'strangeness' occurring in that case.
Also, for the clean shutdown, I think a specific error regarding an illegal xtrigger return dict
would be much better than the less-transparent ultimate error, e.g. the KeyError
in the case from my opening comment.
To go forward, I'll try out various cases for an illegal dict
& report here on the resulting suite behaviour in each case. Then we can discuss what is correct/acceptable & what needs changing. I'll try to recreate the freeze as you described, which will need a fix.
(Sorry @hjoliver for the delay in responding to your previous comments, & those on the other xtrigger Issue I raised at a similar time, #3232, when I went back to look into these after some priority assigned work, I ran into #3275 which seemed the most pressing xtrigger aspect to address!)
@sadielbartholomew @trwhitcomb - yes, agreed that:
I vote for suite staying up and remain responsive in all cases. The events should be logged as errors like task failures.
Should be closed with #3497
In the
results
dictionary returned in the satisfied signature(True, results)
from a custom external trigger function, certain "bad" dictionary values will cause the suite to shut down.I will at some point survey the nature of what constitutes a "bad" value; possibly they are just values which are other structures such as (further) dictionaries (forming an overall nested
results
dict), as in the example used here.We can document that
results
must be in a certain form, but can we also catch such errors for invalid return values from xtriggers so that they do not propagate & cause strange task states & suite shutdown, as described below? Validating the environment variables and their values produced by theresults
dict to ensure they are sound before passing them to downstream tasks seems like the ideal solution, but that may be too tricky (or inefficient, etc.) to set up.Release version(s) and/or repository branch(es) affected?
master
branch i.e.8.0a0
(+ earlier versions).Steps to reproduce the bug
Example with dictionary values in
results
Using the xtrigger given in cylc/cylc-doc#41 in this example suite, which with the "nested" argument returns
results
dict values that are themselves dictionaries:produces the logically wrong task state even for the independent task
to_get_some_cycles_going
, as shown in the screenshot below, & a KeyError shuts down the suite, with the following in thelog/suite/log
:Expected behavior A above, I'm not sure the best way to manage this, but there should be way to prevent bad xtrigger return values from making the suite behave strangely in runtime (not just for downstream tasks) &/or shutting down.
Screenshots