cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
335 stars 94 forks source link

custom xtriggers: undesired behaviour w/ illegal dict #3237

Open sadielbartholomew opened 5 years ago

sadielbartholomew commented 5 years ago

In the results dictionary returned in the satisfied signature (True, results) from a custom external trigger function, certain "bad" dictionary values will cause the suite to shut down.

I will at some point survey the nature of what constitutes a "bad" value; possibly they are just values which are other structures such as (further) dictionaries (forming an overall nested results dict), as in the example used here.

We can document that results must be in a certain form, but can we also catch such errors for invalid return values from xtriggers so that they do not propagate & cause strange task states & suite shutdown, as described below? Validating the environment variables and their values produced by the results dict to ensure they are sound before passing them to downstream tasks seems like the ideal solution, but that may be too tricky (or inefficient, etc.) to set up.

Release version(s) and/or repository branch(es) affected? master branch i.e. 8.0a0 (+ earlier versions).

Steps to reproduce the bug

Example with dictionary values in results

Using the xtrigger given in cylc/cylc-doc#41 in this example suite, which with the "nested" argument returns results dict values that are themselves dictionaries:

[scheduling]
    initial cycle point = now
    [[xtriggers]]
        breakable_xtrigger = breakable_script("nested")
    [[dependencies]]
        [[[PT1M]]]
            graph = """
                    @breakable_xtrigger => wait_around
                    to_get_some_cycles_going =>  wait_around
                    """

produces the logically wrong task state even for the independent task to_get_some_cycles_going, as shown in the screenshot below, & a KeyError shuts down the suite, with the following in the log/suite/log:

2019-07-24T13:48:09+01:00 INFO - [to_get_some_cycles_going.20190724T1348+01] status=running: (received)succeeded at 2019-07-24T13:48:09+01:00 for job(01)
2019-07-24T13:48:10+01:00 ERROR - 'breakable_xtrigger_dictA'
    Traceback (most recent call last):
      File "/net/home/h06/sbarth/cylc.git/cylc/flow/scheduler.py", line 254, in start
        self.run()
      File "/net/home/h06/sbarth/cylc.git/cylc/flow/scheduler.py", line 1532, in run
        self.process_task_pool()
      File "/net/home/h06/sbarth/cylc.git/cylc/flow/scheduler.py", line 1169, in process_task_pool
        self.suite, itasks, self.config.run_mode('simulation'))
      File "/net/home/h06/sbarth/cylc.git/cylc/flow/task_job_mgr.py", line 199, in submit_task_jobs
        prepared_tasks, bad_tasks = self.prep_submit_task_jobs(suite, itasks)
      File "/net/home/h06/sbarth/cylc.git/cylc/flow/task_job_mgr.py", line 175, in prep_submit_task_jobs
        check_syntax=check_syntax)
      File "/net/home/h06/sbarth/cylc.git/cylc/flow/task_job_mgr.py", line 758, in _prep_submit_task_job
        poverride(rtconfig, overrides, prepend=True)
      File "/net/home/h06/sbarth/cylc.git/cylc/flow/parsec/util.py", line 225, in poverride
        poverride(target[key], val, prepend)
      File "/net/home/h06/sbarth/cylc.git/cylc/flow/parsec/util.py", line 225, in poverride
        poverride(target[key], val, prepend)
      File "/net/home/h06/sbarth/cylc.git/cylc/flow/parsec/OrderedDict.py", line 42, in __getitem__
        return OrderedDict.__getitem__(self, key)
    KeyError: 'breakable_xtrigger_dictA'
2019-07-24T13:48:10+01:00 CRITICAL - Suite shutting down - 'breakable_xtrigger_dictA'

Expected behavior A above, I'm not sure the best way to manage this, but there should be way to prevent bad xtrigger return values from making the suite behave strangely in runtime (not just for downstream tasks) &/or shutting down.

Screenshots

xtrigger_nested

hjoliver commented 5 years ago

@sadielbartholomew - see my comments in #3232

If an xtrigger function returns an illegal result, it is technically broken and your suite is presumably going to stall with tasks waiting on it. (Because we - correctly - don't treat a broken xtrigger as "satisfied").

So what to do about that? We could:

  1. have the suite shut down cleanly, with warnings/errors about the broken function
  2. or keep going, even if stalled, but issue prominent warnings/errors (in the suite log at Cylc 7; hopefully more in the Cylc 8 UI)

Number 2. may be preferable so long as you can fix the trigger function in the live suite, but I don't think that 1. is "strange behaviour" :grin:

hjoliver commented 5 years ago

(Sorry, I see your "strange behaviour" comment doesn't relate to the suite shutdown).

hjoliver commented 5 years ago

Re-reading more carefully (sorry)! I guess we need to distinguish between "broken" xtrigger functions that return totally illegal results (i.e. anything but (T/F, dict)) and ones that return success (T, dict) but with an illegal results dict.

In the latter case, it is correct that the downstream task triggered (because the trigger was satisfied) but the illegal results dict caused problems.

trwhitcomb commented 5 years ago

I ran into this as well while trying to write some custom xtrigger functions. In my case, the suite didn't stall because the tasks were waiting on it (as @hjoliver) but rather crashed in a json decode that led to the suite controller itself becoming unresponsive - probably not the intended behavior. I'm OK with the second option above, but having an xtrigger function return totally illegal results shouldn't freeze things up.

sadielbartholomew commented 5 years ago

Thanks for reporting your observations @trwhitcomb. That is definitely a prompt for us to investigate further. In general I agree that a bad dict in (T, dict) should not cause an unclean suite shut down, & certainly not a freeze, but @hjoliver's outlined case:

  1. have the suite shut down cleanly, with warnings/errors about the broken function

is correct, so would not be a bug. Actually when I wrote up the Issue (probably only very clearly, sorry!) as well as the shutdown it appeared that they were tasks emerging or persisting in states that were not right, as depicted somewhat in the Cylc Review screenshot, but I still need to the bottom of the 'strangeness' occurring in that case.

Also, for the clean shutdown, I think a specific error regarding an illegal xtrigger return dict would be much better than the less-transparent ultimate error, e.g. the KeyError in the case from my opening comment.

To go forward, I'll try out various cases for an illegal dict & report here on the resulting suite behaviour in each case. Then we can discuss what is correct/acceptable & what needs changing. I'll try to recreate the freeze as you described, which will need a fix.

(Sorry @hjoliver for the delay in responding to your previous comments, & those on the other xtrigger Issue I raised at a similar time, #3232, when I went back to look into these after some priority assigned work, I ran into #3275 which seemed the most pressing xtrigger aspect to address!)

hjoliver commented 5 years ago

@sadielbartholomew @trwhitcomb - yes, agreed that:

matthewrmshin commented 5 years ago

I vote for suite staying up and remain responsive in all cases. The events should be logged as errors like task failures.

oliver-sanders commented 4 years ago

Should be closed with #3497