cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
332 stars 93 forks source link

Error message when trying to reload on a remote VM when it can't find `cylc` wrapper #6292

Open ColemanTom opened 3 months ago

ColemanTom commented 3 months ago

I'm not sure this is a bug, so I've not labelled it as one. I'm wondering if the error message can be improved at the end of the below output.

WARNING - $ ssh -oBatchMode=yes -oConnectTimeout=10 a-different-host env CYLC_VERSION=8.3.3 CYLC_ENV_NAME=cylc-8.3.3 PROJECT=my_project bash --login -c 'exec "$0" "$@"'
    /some/path/cylc psutil  # returned 127

    /some/path/cylc: /some/path/cylc: No such file or directory

ERROR - Cannot determine whether workflow is running on a-different-host.
    /home/miniconda3/envs/cylc-8.3.3/bin/python /home//miniconda3/envs/cylc-8.3.3/bin/cylc play workflow_name --host=localhost --color=always
CRITICAL - Cannot tell if the workflow is running
    Note, Cylc 8 cannot restart Cylc 7 workflows.
CylcError: Cannot determine whether workflow is running on a-different-host.
/home/miniconda3/envs/cylc-8.3.3/bin/python /home/miniconda3/envs/cylc-8.3.3/bin/cylc play workflow_name --host=localhost --color=always

What happened was

  1. Cylc tried to login to a different VM to run cylc psutil whilst I was reloading a workflow
  2. It couldn't find cylc (don't ask)
  3. It then output the error messages

I wonder if even saying something about that command being to start the workflow rather than it just outputting a fairly random command would be useful. Also, not outputting it twice would be nice.

oliver-sanders commented 3 months ago

This message is a pretty reasonable explanation of the situation:

Cannot determine whether workflow is running on a-different-host.

What's happening here is that a Cylc client attempted to contact the scheduler, but failed. This can be caused by:

As a safety check, the client will check that the scheduler process is still running by performing a process listing on the box where the scheduler was running. If it finds the same PID with the same CMD, it knows that the scheduler is running, but is not contactable for some reason. If it does not find the process, then it knows that the scheduler has crashed and removes the contact file so that other Cylc interfaces can see that the workflow is not running (preventing other clients attempting to contact it).

Under normal circumstances the check should not fail. If it does, it's likely a setup / installation / network problem.