cylc / cylc-flow

Cylc: a workflow engine for cycling systems.
https://cylc.github.io
GNU General Public License v3.0
335 stars 94 forks source link

comms timeout #4112

Open oliver-sanders opened 3 years ago

oliver-sanders commented 3 years ago

Edit (2020-11-19):

Cylc currently uses a PT5S "round journey" timeout for all requests. This can be overridden by the somewhat confusingly named but still functional --comms-timeout option. There is no immediate pressure to develop this timeout system further or to make it configurable at the moment. Keeping this issue open for now in case that changes.

See https://github.com/cylc/cylc-flow/issues/4112#issuecomment-973292574

The --comms-timeout option from Cylc7 is still present in Cylc8, however, it doesn't do quite what it says on the tin.

It is now a global timeout, it is implemented in the Python layer, however:

# this
$ cylc <cmd> --comms-timeout=<time>
# is roughly equivalent to this
$ timeout <time> cylc <cmd>

Current implementation:

https://github.com/cylc/cylc-flow/blob/07c7f5c9c785f097066ffdf8d4d5630b5914d734/cylc/flow/network/client.py#L171-L188

Questions:

1) Reconsider the case for timeouts and determine how they should look in the future

hjoliver commented 3 years ago

I guess we need a timeout for cylc commands executed by task jobs, to avoid hanging jobs unnecessarily. For user-run commands it doesn't matter so much.

hjoliver commented 3 years ago

The issue is, it's not actually a "comms timeout" - should we rename it, to just "timeout"?

Better check that there is a default timeout esp. for cylc message (doc says: for messaging see global config ...)

hjoliver commented 3 years ago

But ... it does apply only to commands that contact a scheduler.

Decision: get rid of the option, replace it with a network timeout configured in global.cylc. PT30S default? Current default for messaging - PT30S.

oliver-sanders commented 3 years ago

~Current default for messaging - PT30S.~

Current default for messaging - PT5S

(note this is a round-trip timeout, i.e. the time between sending the request and receiving the response)

https://github.com/cylc/cylc-flow/blob/efdb002194f90b7dee41f5dd623fd727eb896de9/cylc/flow/network/client.py#L112

oliver-sanders commented 3 years ago

Decision:

dpmatthews commented 3 years ago

Note that the Cylc 7 [task messaging]connection timeout setting was removed in #3402.

oliver-sanders commented 3 years ago

Slight problem with the above decision...

The new comms timeout global configuration will be loaded from the global config on the platform from which the request was made. This means that in order to set this configuration properly it must be set on all platforms.

The alternative being to add a field to the contact file to allow the value to be passed through to remote platforms.

Options:

  1. Require the comms timeoout to be set on all platforms.
  2. Copy the comms timeout into the contact file.
  3. Just do away with it or hardcode a sensible default.

BTW: We are currently doing (3) with a hardcoded default of PT5S. Haven't seen any issues yet.

hjoliver commented 3 years ago

IMO a hard coded timeout is fine, as well as simpler, so long as it is long enough to handle reasonable latencies.

datamel commented 3 years ago

Discussed in VC 18/11/21. Decision remove this from rc1 and track use cases before proceeding with this issue.

hjoliver commented 3 years ago

(Also from the VC: if we do come back to this, having to configure it on remote platforms is not unreasonable given the functionatlity it affects).