cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
335 stars 94 forks source link

gethostbyname_ex error when running Cylc 8 server behind SSH bastion #4981

Closed ScottWales closed 1 month ago

ScottWales commented 2 years ago

Describe the bug We are testing a new environment for our Cylc servers, where the Cylc servers run on a small cluster OOD separate to the main HPC. A user requests a node on the OOD system, e.g. ood-vn17 then connects to that node by ssh via a bastion server in order to run Cylc. Individual OOD nodes are not externally accessible without going through the bastion server.

I have set up cylc to use communication method = ssh for the hpc platform, and have SSH configured on the HPC so that ssh ood-vn17 will automatically tunnel through the bastion server using ProxyJump.

When running a workflow however communications fail from the HPC to the cylc server. Setting debug=true in a job suite gives the error message

Traceback (most recent call last):
  File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/task_message.py", line 107, in send_messages
    pclient = get_client(workflow)
  File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client_factory.py", line 55, in get_client
    return get_runtime_client(get_comms_method(), workflow, timeout=timeout)
  File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client_factory.py", line 49, in get_runtime_client
    return WorkflowRuntimeClient(workflow, timeout=timeout)
  File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/ssh_client.py", line 52, in __init__
    self.host, _, _ = get_location(workflow)
  File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/__init__.py", line 83, in get_location
    host = get_fqdn_by_host(host)
  File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/hostuserutil.py", line 265, in get_fqdn_by_host
    return HostUtil.get_inst().get_fqdn_by_host(target)
  File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/hostuserutil.py", line 171, in get_fqdn_by_host
    return self._get_host_info(target)[0]
  File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/hostuserutil.py", line 135, in _get_host_info
    self._host_exs[target] = socket.gethostbyname_ex(target)
socket.gaierror: [Errno -2] Name or service not known: 'ood-vn17.z48'

gethostbyname_ex is expected to fail here as ood-vn17 is not network accessible from the HPC

Release version(s) and/or repository branch(es) affected?

$ cylc --version
8.0rc3

Steps to reproduce the bug

Configure network so that the Cylc server can't be seen from the platform, e.g. using a SSH bastion, and set the platform communication method to ssh

Expected behavior

If I submit a Cylc 8 task from my OOD node with communication method = ssh and appropriate SSH configuration for ProxyJump I expect communications to work by tunnelling through the bastion server, without giving an error from other network connection types

Screenshots

Additional context

Pull requests welcome! This is an Open Source project - please consider contributing a bug fix yourself (please read CONTRIBUTING.md before starting any work though).

oliver-sanders commented 2 years ago

This isn't a use case we currently support but it should be simple to get it working.

If you are able to patch your Cylc installation I think this should permit your use case:

diff --git a/cylc/flow/network/__init__.py b/cylc/flow/network/__init__.py
index e456de3fb..3f4e91e70 100644
--- a/cylc/flow/network/__init__.py
+++ b/cylc/flow/network/__init__.py
@@ -16,8 +16,10 @@
 """Package for network interfaces to Cylc scheduler objects."""

 import asyncio
+from contextlib import suppress
 import getpass
 import json
+import socket

 import zmq
 import zmq.asyncio
@@ -78,7 +80,8 @@ def get_location(workflow: str):
         raise WorkflowStopped(workflow)

     host = contact[ContactFileFields.HOST]
-    host = get_fqdn_by_host(host)
+    with suppress(socket.gaierror):
+        host = get_fqdn_by_host(host)
     port = int(contact[ContactFileFields.PORT])
     if ContactFileFields.PUBLISH_PORT in contact:
         pub_port = int(contact[ContactFileFields.PUBLISH_PORT])
ScottWales commented 2 years ago

Thanks this has worked well - I also had to modify the host self-identification so that it used the hostname rather than the fqdn. If no issues come up in our testing I'll make a pull request with both changes.

oliver-sanders commented 1 year ago

@ScottWales how did it work out?

oliver-sanders commented 1 month ago

@ScottWales, I'm going to close this PR, though we're still happy to work towards it as needed.

There is a new umbrella issue covering host identification in general which I will link this into (#3768).