Closed ScottWales closed 1 month ago
This isn't a use case we currently support but it should be simple to get it working.
If you are able to patch your Cylc installation I think this should permit your use case:
diff --git a/cylc/flow/network/__init__.py b/cylc/flow/network/__init__.py
index e456de3fb..3f4e91e70 100644
--- a/cylc/flow/network/__init__.py
+++ b/cylc/flow/network/__init__.py
@@ -16,8 +16,10 @@
"""Package for network interfaces to Cylc scheduler objects."""
import asyncio
+from contextlib import suppress
import getpass
import json
+import socket
import zmq
import zmq.asyncio
@@ -78,7 +80,8 @@ def get_location(workflow: str):
raise WorkflowStopped(workflow)
host = contact[ContactFileFields.HOST]
- host = get_fqdn_by_host(host)
+ with suppress(socket.gaierror):
+ host = get_fqdn_by_host(host)
port = int(contact[ContactFileFields.PORT])
if ContactFileFields.PUBLISH_PORT in contact:
pub_port = int(contact[ContactFileFields.PUBLISH_PORT])
Thanks this has worked well - I also had to modify the host self-identification so that it used the hostname rather than the fqdn. If no issues come up in our testing I'll make a pull request with both changes.
@ScottWales how did it work out?
@ScottWales, I'm going to close this PR, though we're still happy to work towards it as needed.
There is a new umbrella issue covering host identification in general which I will link this into (#3768).
Describe the bug We are testing a new environment for our Cylc servers, where the Cylc servers run on a small cluster
OOD
separate to the main HPC. A user requests a node on the OOD system, e.g.ood-vn17
then connects to that node by ssh via a bastion server in order to run Cylc. Individual OOD nodes are not externally accessible without going through the bastion server.I have set up cylc to use
communication method = ssh
for the hpc platform, and have SSH configured on the HPC so thatssh ood-vn17
will automatically tunnel through the bastion server usingProxyJump
.When running a workflow however communications fail from the HPC to the cylc server. Setting
debug=true
in a job suite gives the error messagegethostbyname_ex
is expected to fail here asood-vn17
is not network accessible from the HPCRelease version(s) and/or repository branch(es) affected?
Steps to reproduce the bug
Configure network so that the Cylc server can't be seen from the platform, e.g. using a SSH bastion, and set the platform communication method to
ssh
Expected behavior
If I submit a Cylc 8 task from my OOD node with
communication method = ssh
and appropriate SSH configuration forProxyJump
I expect communications to work by tunnelling through the bastion server, without giving an error from other network connection typesScreenshots
Additional context
Pull requests welcome! This is an Open Source project - please consider contributing a bug fix yourself (please read
CONTRIBUTING.md
before starting any work though).