Open jameshcorbett opened 1 month ago
We have a fix for the dangling finalizer, which prevents the workflow from being deleted, in https://github.com/flux-framework/flux-coral2/issues/165
Also, note that the warning above says that it did 2 retries, after which I'm guessing finally succeeded, because your notes don't say it was followed by an error indicating that it failed. We've seen this same warning when we restart the haproxy on the control plane node that has the VIP.
Problem: the coral2-dws service on elcap sometimes loses connection to the k8s server, logging
Sometimes this can cause workflows to become stuck.
Somehow the service should become more resilient, and keep retrying with a backoff.