platforms: broadcasted platform ignored after ssh failure

oliver-sanders commented 2 months ago

We can use broadcasts to change the platform a task submits to.

Under normal circumstances this works fine, however, when hosts go down and the submission is retried, the broadcast seems to be forgotten about and the new submission uses the configured platform.

This could lead to jobs being submitted to the wrong platform.

Reproducible example:

Run the following workflow.

Once the "remote_init_one" and "remote_init_two" tasks have submitted, break your SSH config to force subsequent calls to fail.

[scheduling]
    [[graph]]
        R1 = remote_init_one & remote_init_two & local => remote

[runtime]
    # ensure that the workflow has remote-init'ed on platforms "one" and "two"
    [[remote_init_one]]
        platform = one-bg
    [[remote_init_two]]
        platform = two-bg

    # change the platform of "remote" via broadcast
    [[local]]
        script = """
            cylc broadcast "${CYLC_WORKFLOW_ID}" -n remote -p "${CYLC_TASK_CYCLE_POINT}" -s 'platform=one'
            sleep 10
        """

    [[remote]]
        platform = localhost

The "remote" task should attempt to submit to each of the hosts in the "one" platform. All SSH connections will fail so the task will run out of hosts and become submit-failed.

However, that's not what happens! Running this command reveals that after running out of hosts, the task then attempted to submit to localhost (the platform defined before the broadcast):

$ grep 'DEBUG - \[jobs-submit cmd\].*1/remote/01' --color=never ~/cylc-run/<workflow>/log/scheduler/log
... ssh ... one.01 ... cylc jobs-submit ... 1/remote/01
... ssh ... one.02 ... cylc jobs-submit ... 1/remote/01
... cylc jobs-submit ... 1/remote/01

Note: This erroneous submission appears to happen after all the hosts of the broadcasted platform have been exhausted which may help pin down the offending code pathway.

Interestingly, when I try this, the attempted submission to localhost actually fails due to the qsub command not being in $PATH. In my case platform one uses PBS so this suggests that it is attempting to submit to localhost, but with the configuration of one?!

wxtim commented 2 months ago

Looks like it's remote initing on the same host?

    [[remote_init_one]]
        platform = one-bg
    [[remote_init_two]]
        platform = one-bg

oliver-sanders commented 2 months ago

Typo, corrected in OP

wxtim commented 2 months ago

Replicated it with local site installation. Now working out how to replicate in a more debuggable way.

oliver-sanders commented 2 months ago

I think this example should be enough to debug with. Here's my stab in the dark over debugging strategy if it helps....

I would start by identifying the bits of the code where a host is selected and logging each of these. This should allow you to pinpoint the particular branch / method where the incorrect host comes from. Given the convoluted nature of the call/callback code, the same method can be called multiple times, so this might not actually be that much help. If so, I would then try to log the relevant function calls (likely prep/submit methods and their 255 callbacks) so you can map out the callchain. After that, no idea!

wxtim commented 2 months ago

Checks for similar bugs:

Search for rtconfig\[["']platform["']\]:

data_store_mgr.runtime_from_config - Looks like it's used to initialize fields at startup so should be safe to not check for broadcasts. Checked by looking at TUI.
subprocpool.SubProcPoll.run_command_exit - Functionally safe because it's only used for logging. Might concievable produce strange log output, but even this shouldn't happen if the callback is given sensible arguments - an apparent bug found in this code at this point during the investigation dissapeared once the fix in #6330 was made.
All other lookups are in task_job_mgr.TaskJobManager._prep_submit_task_job on a function scoped copy of the rtconfig which has broadcasts applied.

oliver-sanders commented 1 month ago

Here's a version of the workflow in the OP that has been adapted to use [remote]host and [job]batch system rather than platform.

This example does not replicate the bug (presumably uses a different code pathway):

[scheduling]
    [[graph]]
        R1 = remote_init_one & remote_init_two & local => remote

[runtime]
    # ensure that the workflow has remote-init'ed on platforms "one" and "two"
    [[remote_init_one]]
        [[[remote]]]
            host = one.login.01
    [[remote_init_two]]
        [[[remote]]]
            host = two.login.01

    # change the platform of "remote" via broadcast
    [[local]]
        script = """
            cylc broadcast "${CYLC_WORKFLOW_ID}" -n remote -p "${CYLC_TASK_CYCLE_POINT}" -s '[remote]host=one.login.01'
            sleep 10
        """

    [[remote]]
        [[[remote]]]
            host = localhost
        [[[job]]]
            batch system = pbs

Posting this here as I'm using this to test the fix to ensure it still works as intended.

oliver-sanders commented 1 month ago

Closed by #6330

cylc / cylc-flow

platforms: broadcasted platform ignored after ssh failure #6320

Reproducible example:

Checks for similar bugs: