cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
335 stars 94 forks source link

Host identification #3768

Open hjoliver opened 4 years ago

hjoliver commented 4 years ago

Description below pasted verbatim from Element chat room (@oliver-sanders please edit if desired).

See also comments on #3766 and #3595 which this issue supersedes.

See also:

TLDR;

hold fire on FQDN/localhost changes for the moment and wait until the dust has settled after the platforms work. Have a tidy up of the hostuserutil and other remote logic to see what's left and find out what niche requirements we might still have, hopefully not many, then work out how best to implement whatever checks we still require.

Full post:

Just had a chat with Dave about FQDN, localhost self-identification, etc relating to #3766 #3595:

At present we rely on the premise that for each host there exists a unique global identifier, its FQDN and that that identifier can be obtained from anywhere on the network. This system is nice and universal so we can use it for all purposes, e.g. comparing localhost to remote hosts saving us from using separate logic for different purposes.

Unfortunately the assumption that an FQDN is a unique global identifier for every host is flawed no matter what method we use to retrieve the FQDN. FQDN and DNS issues have been a consistent source of pain for a long time (I have to apply a patch just to get Cylc to run on my box) in need of a solid solution.

Off the top of my head, we use this FQDN logic for things like:

  • Filtering out duplicate hosts.
  • Reducing SSH'es by batching them together by hostname.
  • Determining whether X is an identifier for localhost (e.g. is this host in the list of condemned hosts or working out whether we need to SSH or not)

Once the platforms work is merged we will have "configured away" the need to compare remote host FQDNs, hopefully completely. The matter of filtering out duplicate hosts from a list is something we could do away with since it is a configuration error not a Cylc problem.

I think (perhaps with a bit of fiddling) we might be left with the problem of localhost self-identification (third bullet point above) which may enable us to ditch FQDN logic completely in favour of a more reliable method.

tldr; So, my suggestion would be to hold fire on FQDN/localhost changes for the moment and wait until the dust has settled after the platforms work. Have a tidy up of the hostuserutil and other remote logic to see what's left and find out what niche requirements we might still have, hopefully not many, then work out how best to implement whatever checks we still require.

oliver-sanders commented 3 years ago

[update] 2021-08

Platforms work done, there is still the requirement for two interfaces:

So there is scope for simplification, I started looking into this, however, it gets messy and I chickened out in order to get higher priority work done.

Here's how I think the user-facing interfaces could look:

https://github.com/oliver-sanders/cylc-flow/blob/db7e7eeab9e792675cafea39b198c57bad618a4f/cylc/flow/cfgspec/globalcfg.py#L235-L291

And here's how the cylc.flow.hostuserutil module could be re-written:

https://github.com/oliver-sanders/cylc-flow/blob/dns/cylc/flow/network/hostname.py

Propose bumping to 8.x and addressing when the time/demand allows.

hjoliver commented 3 years ago

Bumped to 8.x

oliver-sanders commented 8 months ago

This has recently been flagged again in https://github.com/cylc/cylc-flow/issues/6005, https://github.com/cylc/cylc-flow/issues/6004

By default, Cylc uses server FQDN's to identify servers, we rely on these FQDN's being 100% consistent across the network.

I.E. if a host self-identifies as abc.def it should also be identified as abc.def from any other host on the network. Whilst this might be reasonable assumption and true at most of the major Cylc sites, sadly, it is not always the case. HPC networking can be a tad eccentric and those using the HPC platform might have no control over its setup.

For examples of how inconsistent DNS setups can be, even on simple platforms, see this issue: https://github.com/cylc/cylc-flow/issues/3595

This can be exacerbated by the Python socket interfaces potentially changing behaviour between builds. This is part of why hostname -f may differ from socket.get_fqdn on different platforms.

We have made multiple attempts to come up with an approach that works for everyone but, sadly, we have failed.

I suggest that we should re-write the hostuserutil module that provides Cylc's DNS functionality so that the base methods that are used to identify servers are user configurable. We should also take the opportunity to review the use of FQDN host names across Cylc to see if there is anything we can do to loosen the requirement for fully consistent DNS.

In theory, there's no reason why we couldn't provide a solution that useshostname -f to determine the FQDN (but cache the result to avoid repeat calls of course).

oliver-sanders commented 8 months ago

Frustratingly, I actually did have a branch that did this once, but bailed on it as being too high a risk for Cylc 8.0.0 as the new behaviour will not be exactly the same as the old due to the re-jigging of interfaces.

~I'll see what I can dig out.~ - https://github.com/oliver-sanders/cylc-flow/commits/dns/