flux-framework / flux-coral2

Plugins and services for Flux on CORAL2 systems
GNU Lesser General Public License v3.0
8 stars 6 forks source link

Configure coral2-dws's k8s connections to tolerate more lost connections #159

Open jameshcorbett opened 1 month ago

jameshcorbett commented 1 month ago

Problem: the coral2-dws service on elcap sometimes loses connection to the k8s server, logging

WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /apis/dataworkflowservices.github.io/v1alpha2/namespaces/default/workflows?resourceVersion=0&timeoutSeconds=1&watch=True

Sometimes this can cause workflows to become stuck.

Somehow the service should become more resilient, and keep retrying with a backoff.

roehrich-hpe commented 1 month ago

We have a fix for the dangling finalizer, which prevents the workflow from being deleted, in https://github.com/flux-framework/flux-coral2/issues/165

Also, note that the warning above says that it did 2 retries, after which I'm guessing finally succeeded, because your notes don't say it was followed by an error indicating that it failed. We've seen this same warning when we restart the haproxy on the control plane node that has the VIP.