Closed oliver-sanders closed 1 month ago
Looks like it's remote initing on the same host?
[[remote_init_one]]
platform = one-bg
[[remote_init_two]]
platform = one-bg
Typo, corrected in OP
Replicated it with local site installation. Now working out how to replicate in a more debuggable way.
I think this example should be enough to debug with. Here's my stab in the dark over debugging strategy if it helps....
I would start by identifying the bits of the code where a host is selected and logging each of these. This should allow you to pinpoint the particular branch / method where the incorrect host comes from. Given the convoluted nature of the call/callback code, the same method can be called multiple times, so this might not actually be that much help. If so, I would then try to log the relevant function calls (likely prep/submit methods and their 255 callbacks) so you can map out the callchain. After that, no idea!
Search for rtconfig\[["']platform["']\]
:
data_store_mgr.runtime_from_config
- Looks like it's used to initialize fields at startup so should be safe to not check for broadcasts. Checked by looking at TUI.subprocpool.SubProcPoll.run_command_exit
- Functionally safe because it's only used for logging. Might concievable produce strange log output, but even this shouldn't happen if the callback is given sensible arguments - an apparent bug found in this code at this point during the investigation dissapeared once the fix in #6330 was made.task_job_mgr.TaskJobManager._prep_submit_task_job
on a function scoped copy of the rtconfig which has broadcasts applied.Here's a version of the workflow in the OP that has been adapted to use [remote]host
and [job]batch system
rather than platform.
This example does not replicate the bug (presumably uses a different code pathway):
[scheduling]
[[graph]]
R1 = remote_init_one & remote_init_two & local => remote
[runtime]
# ensure that the workflow has remote-init'ed on platforms "one" and "two"
[[remote_init_one]]
[[[remote]]]
host = one.login.01
[[remote_init_two]]
[[[remote]]]
host = two.login.01
# change the platform of "remote" via broadcast
[[local]]
script = """
cylc broadcast "${CYLC_WORKFLOW_ID}" -n remote -p "${CYLC_TASK_CYCLE_POINT}" -s '[remote]host=one.login.01'
sleep 10
"""
[[remote]]
[[[remote]]]
host = localhost
[[[job]]]
batch system = pbs
Posting this here as I'm using this to test the fix to ensure it still works as intended.
Closed by #6330
We can use broadcasts to change the platform a task submits to.
Under normal circumstances this works fine, however, when hosts go down and the submission is retried, the broadcast seems to be forgotten about and the new submission uses the configured platform.
This could lead to jobs being submitted to the wrong platform.
Reproducible example:
Run the following workflow.
Once the "remote_init_one" and "remote_init_two" tasks have submitted, break your SSH config to force subsequent calls to fail.
The "remote" task should attempt to submit to each of the hosts in the "one" platform. All SSH connections will fail so the task will run out of hosts and become submit-failed.
However, that's not what happens! Running this command reveals that after running out of hosts, the task then attempted to submit to localhost (the platform defined before the broadcast):
Note: This erroneous submission appears to happen after all the hosts of the broadcasted platform have been exhausted which may help pin down the offending code pathway.
Interestingly, when I try this, the attempted submission to
localhost
actually fails due to theqsub
command not being in$PATH
. In my case platformone
uses PBS so this suggests that it is attempting to submit tolocalhost
, but with the configuration ofone
?!