Open ScottWales opened 2 years ago
The current behaviour is deliberate - ssh could fail for other reasons so we don't remove the contact file unless we can connect to the server to check whether the workflow is still running.
We need to think about the best way to address your requirement.
We are running Cylc servers on a system that assigns you a specific node for your session, and doesn't allow you to connect to nodes you don't have a running session on. We know that if the ssh connection fails then there is not an active session and hence no cylc server can be running.
Ha, I've run into the same problem for NeSI HPC users in NZ who come in via JupyterHub to an interactive Slurm session. Once the session ends, you don't have access to the node that it ran on.
Would a potential solution be to add a site config option that treats a session as not running if it's not contactable by SSH? That could then apply to all tools, allowing cylc clean
to be used directly without needing to remove the contact files.
Yeah, that could be a solution. I'm not sure we could do anything else really, given inability to access the original host.
I've added this to the agenda for tonight's project meeting (8pm NZ time, I can forward Teams invite if you'd like to attend - but no pressure!)
Actually @ScottWales - on my system, if this happens we get an error message saying something like "access denied because you don't have any processes running on this node". Do you get anything similar. In the unlikely event that that is a standard response, we could presumably parse it and infer that nothing is running there, and so delete the contact file.
The ssh command returns
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by 10.0.128.131 port 22
Interesting, I'll compare my result later ...
And sure, I'd like to come along to the meeting, it would be good to check we've got our installation set up properly. My email's now scott.wales at bom.gov.au
Invite forwarded. (Also an invite to the "Cylc General" Element chat room, in case the Teams invite borks for some reason).
See also #5013
Describe the bug
According to its documentation
cylc scan --ping
should remove contact files for suites it's not able to connect to.If the server node cannot be contacted however only a warning is printed and the contact file remains
Release version(s) and/or repository branch(es) affected?
Steps to reproduce the bug
cylc scan --poll
to remove the contact file, currently does not remove the contact filecylc clean
to remove files from the failed run, currently does not remove files as the contact file is still present and server 1 cannot be contactedExpected behavior
The contact file should be removed after Cylc fails to connect to a running server
Additional context
My main goal is to be able to run
cylc clean
to clean up the failed run, the clean fails with an error if the contact file remains and so I'm trying to usecylc scan --ping
to remove it.We are running Cylc servers on a system that assigns you a specific node for your session, and doesn't allow you to connect to nodes you don't have a running session on. We know that if the ssh connection fails then there is not an active session and hence no cylc server can be running.
Pull requests welcome! This is an Open Source project - please consider contributing a bug fix yourself (please read
CONTRIBUTING.md
before starting any work though).