cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
335 stars 94 forks source link

`cylc scan --ping` not removing contact files #5051

Open ScottWales opened 2 years ago

ScottWales commented 2 years ago

Describe the bug

According to its documentation cylc scan --ping should remove contact files for suites it's not able to connect to.

  --ping                Test the connection to the flow. Scan normally just
                        reads flow contact files, but --ping forces a
                        connection to the scheduler and removes the contact
                        file if it is not found to be running (this can happen
                        if the scheduler gets killed and can't clean up after
                        itself).

If the server node cannot be contacted however only a warning is printed and the contact file remains

# Server node cannot be contacted - session has ended
$ cylc scan --ping --verbose
2022-08-10T09:53:53+10:00 DEBUG - zmq:send {'command': 'graphql', 'args': {'request_string': 'query { workflows(ids: ["u-cp519/run10"]) { \nstatus\n } }',
    'variables': {}}, 'meta': {'prog': 'scan', 'host': 'ood-vn26', 'comms_method': 'zmq'}}
2022-08-10T09:53:58+10:00 DEBUG - $ ssh -oBatchMode=yes -oConnectTimeout=10 ood-vn3 env CYLC_VERSION=8.0.0 bash --login -c 'exec "$0" "$@"'
    /g/data/access/ngm/miniconda3/envs/cylc-8.0/ubin/cylc psutil  # returned 255
    Access denied by pam_slurm_adopt: you have no active jobs on this node
    Connection closed by 10.0.128.131 port 22

2022-08-10T09:53:58+10:00 WARNING - Cannot determine whether workflow is running on ood-vn3.
    /g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/bin/python /g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/bin/cylc play u-cp519

# Expect this file to be removed after `cylc scan --ping` failed to connect
$ ls ~/cylc-run/u-cp519/run10/.service/contact
/home/562/saw562/cylc-run/u-cp519/run10/.service/contact

Release version(s) and/or repository branch(es) affected?

$ cylc --version
8.0.0

Steps to reproduce the bug

  1. Have two Cylc servers on different computers sharing the same filesystem
  2. Start a Suite on server 1, then terminate server 1 while the suite is still running so the contact file remains
  3. On server 2 run cylc scan --poll to remove the contact file, currently does not remove the contact file
  4. On server 2 run cylc clean to remove files from the failed run, currently does not remove files as the contact file is still present and server 1 cannot be contacted

Expected behavior

The contact file should be removed after Cylc fails to connect to a running server

Additional context

My main goal is to be able to run cylc clean to clean up the failed run, the clean fails with an error if the contact file remains and so I'm trying to use cylc scan --ping to remove it.

We are running Cylc servers on a system that assigns you a specific node for your session, and doesn't allow you to connect to nodes you don't have a running session on. We know that if the ssh connection fails then there is not an active session and hence no cylc server can be running.

Pull requests welcome! This is an Open Source project - please consider contributing a bug fix yourself (please read CONTRIBUTING.md before starting any work though).

dpmatthews commented 2 years ago

The current behaviour is deliberate - ssh could fail for other reasons so we don't remove the contact file unless we can connect to the server to check whether the workflow is still running.

We need to think about the best way to address your requirement.

hjoliver commented 2 years ago

We are running Cylc servers on a system that assigns you a specific node for your session, and doesn't allow you to connect to nodes you don't have a running session on. We know that if the ssh connection fails then there is not an active session and hence no cylc server can be running.

Ha, I've run into the same problem for NeSI HPC users in NZ who come in via JupyterHub to an interactive Slurm session. Once the session ends, you don't have access to the node that it ran on.

ScottWales commented 2 years ago

Would a potential solution be to add a site config option that treats a session as not running if it's not contactable by SSH? That could then apply to all tools, allowing cylc clean to be used directly without needing to remove the contact files.

hjoliver commented 2 years ago

Yeah, that could be a solution. I'm not sure we could do anything else really, given inability to access the original host.

I've added this to the agenda for tonight's project meeting (8pm NZ time, I can forward Teams invite if you'd like to attend - but no pressure!)

hjoliver commented 2 years ago

Actually @ScottWales - on my system, if this happens we get an error message saying something like "access denied because you don't have any processes running on this node". Do you get anything similar. In the unlikely event that that is a standard response, we could presumably parse it and infer that nothing is running there, and so delete the contact file.

ScottWales commented 2 years ago

The ssh command returns

    Access denied by pam_slurm_adopt: you have no active jobs on this node
    Connection closed by 10.0.128.131 port 22
hjoliver commented 2 years ago

Interesting, I'll compare my result later ...

ScottWales commented 2 years ago

And sure, I'd like to come along to the meeting, it would be good to check we've got our installation set up properly. My email's now scott.wales at bom.gov.au

hjoliver commented 2 years ago

Invite forwarded. (Also an invite to the "Cylc General" Element chat room, in case the Teams invite borks for some reason).

oliver-sanders commented 2 years ago

See also #5013