cylc / cylc-doc

Documentation (User Guide, Cheat Sheets, etc.) for the Cylc Workflow Engine.
https://cylc.github.io/cylc-doc/
GNU General Public License v3.0
9 stars 19 forks source link

ssh stricthostkeychecking #512

Open hjoliver opened 2 years ago

hjoliver commented 2 years ago

This is a problem many new users run into, in my experience. It's not a "Cylc problem" as such, but Cylc depends critically on ssh, so we should at least document the issue and how to handle it.

When you ssh to another host, the host has to appear in a central or user known_hosts file, or else the connection will fail (BatchMode=yes) with "Host key verification failed", or (BatchMode=no) you have to respond to an interactive prompt to add the host to your user .ssh/know_hosts file.

I think clusters typically (always?) have a central known_hosts file that avoids this issue for cluster-internal ssh connections. But users will still run into the problem, with Cylc, if they have to:

(At NIWA, this applies between user desktop machines and HPC platforms; and between several somewhat-distinct HPC clusters).

# Example: ssh from a host on one HPC platform (mahuika) to a host on another (maui)

oliverh@mahuika02:~$ ssh w-cylc01 hostname
The authenticity of host 'w-cylc01 (xxx.xxx.xxx.xxx)' can't be established.
ECDSA key fingerprint is SHA256:xxx.
ECDSA key fingerprint is MD5:xxx.
Are you sure you want to continue connecting (yes/no)? no
Host key verification failed.

oliverh@mahuika02:~$ ssh -o BatchMode=yes w-cylc01 hostname
Host key verification failed.

oliverh@mahuika02:~$ ssh -o BatchMode=yes -o stricthostkeychecking=no w-cylc01 hostname
Warning: Permanently added 'w-cylc01,xxx.xxx.xxx.xxx' (ECDSA) to the list of known hosts.
w-cylc01.maui.niwa.co.nz

oliverh@mahuika02:~$ ssh -o BatchMode=yes w-cylc01 hostname
w-cylc01.maui.niwa.co.nz

Cylc doesn't obfuscate the problem for scheduler start-up (although many users still have no idea what it means):

oliverh@mahuika02:~/cylc-src/bug$ cylc play -n bug
Host key verification failed.

For job platforms, debugging would be a big ask for a new user.

Scheduler log:

$ cylc play --host=localhost -n bug
...
2022-07-05T01:40:46Z INFO - platform: cylc-jobs - remote init (on w-cylc02)
2022-07-05T01:40:47Z WARNING - platform: cylc-jobs - Could not connect to w-cylc02.
    * w-cylc02 has been added to the list of unreachable hosts
    * remote-init will retry if another host is available.
...
2022-07-05T01:40:49Z ERROR - platform: cylc-jobs - initialisation did not complete
    Unable to find valid host for cylc-jobs
2022-07-05T01:40:50Z INFO - [1/foo preparing job:01 flows:1] host=w-cylc02
2022-07-05T01:40:50Z ERROR - [jobs-submit cmd] (init w-cylc02)
    [jobs-submit ret_code] 1
    [jobs-submit err] REMOTE INIT FAILED
2022-07-05T01:40:50Z ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1

Job activity file:

oliverh@mahuika02:~/cylc-src/bug$ cylc cat-log bug -f a bug//1/foo
[jobs-submit cmd] (init w-cylc02)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED
[jobs-submit cmd] (remote init)
[jobs-submit ret_code] 1

And finally in the job err file (this location is a bit surprising, since the job was never submitted):

oliverh@mahuika02:~/cylc-src/bug$ cylc cat-log bug -f e bug//1/foo
Host key verification failed.
hjoliver commented 2 years ago

QUESTIONS

hjoliver commented 2 years ago

(Adding to 8.x for now, but it would be good to sort this out quickly as it can affect new user uptake).

oliver-sanders commented 2 years ago

should Cylc automatically use ssh -o stricthostkeychecking=no

Cylc does not set stricthostkeychecking explicitly so it is relying on the user's SSH config. The SSH command is configurable so sites can configure this in the Cylc config.

oliver-sanders commented 2 years ago

IMO the current behaviour is correct (stricthostkeychecking by default, can be configured otherwise).

If agreed we should turn this into a documentation issue and make sure this is properly covered in the platform configuration section.

hjoliver commented 2 years ago

Agreed, moving to cylc-doc...

hjoliver commented 2 years ago

The same issue can occur when cylc play puts a scheduler on a run host.

We need to document that the ssh command used for that comes from the localhost platform settings.