DiamondLightSource / blueapi

Apache License 2.0
2 stars 5 forks source link

Failed RabbitMQ Connection on Kubernetes not Handled #459

Open callumforrester opened 2 months ago

callumforrester commented 2 months ago

285 has made dev easier but broken prod!

Blueapi is capable of starting without a message bus for an easy dev experience, but on kubernetes, if the message bus connection fails, we want it to fall over so kubernetes restarts it.

Kubernetes cannot be instructed to start its pods in a certain order, it relies on eventual consistency. It throws up all pods simultaneously, if some depend on others they may fall over, kubernetes will restart them. It will repeat this process until all pods are happy with an exponential backoff.

Proposed Solution

We could go back to falling over if there is config for a message bus but the connection fails, we could continue to not connect to a message bus if there is no config for one.

Thoughts @joeshannon @DiamondJoseph @ZohebShaikh

ZohebShaikh commented 2 months ago

We can use ENV variables in production to change the behavior of the message bus(ie make it fail if message bus is not working)..This can keep our production as well as our developer experience better

DiamondJoseph commented 2 months ago

I think if stomp has been defined in the config, and it can't connect to the defined bus, it should fall over regardless. In dev if you don't need a message bus simply don't define one.

stan-dot commented 1 month ago

@DiamondJoseph why should it fail? it could provide other functionalities like 'get devices', 'get plans' from the server, just refuse to run a plan. that way you have more fine-grained information. to shut down everything is coarse

stan-dot commented 1 month ago

that issue did not have a specific user story, maybe that was the source of the confusion https://github.com/DiamondLightSource/blueapi/issues/285

callumforrester commented 1 month ago

@stan-dot because if it doesn't fail then its state is ambiguous from the point of view of a tool like Kubernetes.

If we know it will fail if it doesn't connect, and we know the kubernetes pod has successfully started, we can assume that it is connected to the message bus.

If the process keeps working after failure, the pod will stay up and we can no longer make that assumption.

stan-dot commented 1 month ago

the relation of 'we need to know the state unambiguously' to 'the pod must fail' is problem to solution.

different solutions might be possible. a form of polling or 'initContainer' might be used.

useInitContainer: true

initContainers:

for me if it fails I might still don't know the reason why. Maybe some config is broken, maybe someone broke ophyd-async. message bus is just one of the cases, and I don't know the baseline frequency of it failing.

if blueapi runs I can check devices, check plans and only see the error for the message bus when I try to run plans

callumforrester commented 1 month ago

we need to know the state unambiguously *Kubernetes needs to know the state unambiguously

Your initContainer approach would still fail if, say, a temporary network issue occurred after the initContainer had finished but before the main container started up. Kubernetes expects things to fail-fast if they're not happy, it's the best approach to ensuring robustness.

That said, another alternative could be a liveness probe that would deem the pod unhealthy if it wasn't connected to the bus.

stan-dot commented 1 month ago

yeah liveness probe would be the best.

we could do https://github.com/epics-containers/ioc-adsimdetector/blob/main/ioc/liveness.sh

as documented here https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

callumforrester commented 1 month ago

yeah liveness probe would be the best.

Any particular reason you prefer that solution to the other one? I'm personally more in favour of fail-fast because it is less ambiguous.

stan-dot commented 1 month ago

it's more ambiguous as what is the reason for failure, right? there is a set of conditions for any image to load, then there is application specific errors.

the main scenario I have in my mind is: 'something fails, I know this, and how narrow and specific guidance can the first error indicator I see give me'

maybe you think more about the scenario where the main bottleneck is earlier, to notice that something is wrong in the first place. Fail fast also clogs the alarm dashboard https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/

stan-dot commented 1 month ago

That's what I heard:

The solution you proposed—making the pod fail if no message bus is detected at startup—can be effective in certain scenarios, especially where the message bus is absolutely critical for the application to function properly. However, from a systems design perspective, there are several factors to consider:

Evaluation of the Proposed Solution

  1. Visibility and Diagnostics: Failing the pod outright might make it immediately apparent that something is wrong, which is good for quickly noticing problems. However, it can make diagnostics more difficult if the pods are constantly crashing, as it may obscure the underlying issue.
  2. Resilience: Automatically failing a pod can create a fragile system, especially if the message bus has intermittent issues or minor downtimes. This could lead to a cascade of failures and potentially a full system outage.
  3. Recovery: If the pod fails on startup due to the absence of a message bus, the system's ability to recover automatically when the message bus becomes available again might be compromised unless there is a robust mechanism (like Kubernetes liveness probes or an operator) to manage pod restarts efficiently.

Alternatives

Here are some alternatives that might offer a more resilient approach:

  1. Health Checks and Probes: Implement readiness and liveness probes in your Kubernetes setup. The readiness probe can check for the availability of the message bus before allowing traffic to the pod. This way, the pod is marked as "not ready" but doesn't crash, remaining in the cluster and retrying until the message bus becomes available.
  2. Decoupling and Graceful Degradation: Modify the application to function (perhaps with limited capabilities) even when the message bus is unavailable. This approach improves system resilience and user experience, avoiding complete system failure.
  3. Initialization Containers: Use an init container in your Kubernetes pod configuration that checks for the message bus availability before the main application starts. If the message bus is not available, the init container can delay the startup of the application container or retry until it becomes available.
  4. Event-Driven Architecture: Implement an event listener within the application that triggers certain functions only when specific messages are received. This makes the system more flexible and capable of handling periods when the message bus is down.
  5. Monitoring and Alerts: Rather than making the pod fail, a better approach might involve robust monitoring of the message bus health and configuring alerts to notify the operations team if it becomes unavailable. This allows for proactive management of the issue without impacting the service availability.
callumforrester commented 1 month ago

In either case, the pod dies if it can't connect to the message bus. If we use a liveness probe, it takes longer to do so. @DiamondJoseph suggests that we actually need both. The startup failure will kill the pod quickly if there is no message bus. The liveness probe will catch the case of the connection failing during runtime.

stan-dot commented 1 month ago

@gilesknap what is the behavior in this situation for the controls IOCs?

gilesknap commented 1 month ago

Hi @stan-dot. We have discussed this issue a few times and our current thinking is that when an IOC cannot connect to its device we exit the process.

This means Kubernetes knows there is an issue and can use restart policy and backoff parameters to manage reconnection attempts and alerting.

We also have a liveness probe that monitors a know PV - thus if the IOC stops serving its PVs but does not exit, Kubernetes can take over and restart it

gilesknap commented 1 month ago

Hi @stan-dot. We have discussed this issue a few times and our current thinking is that when an IOC cannot connect to its device we exit the process.

This means Kubernetes knows there is an issue and can use restart policy and backoff parameters to manage reconnection attempts and alerting.

We also have a liveness probe that monitors a know PV - thus if the IOC stops serving its PVs but does not exit, Kubernetes can take over and restart it

I do agree with the point that this causes problems with diagnosing what caused the failure.

stan-dot commented 1 month ago

thank you for the comprehensive feedback @gilesknap .

For blueapi it looks like we'd like both to fail early and do a liveness check periodically to 'fail it manually'. we could implement this looking at how it is done with the IOCs.

where is the liveness probe code located, @gilesknap ?

gilesknap commented 1 month ago

where is the liveness probe code located, @gilesknap ?

Hehe! - it's not rocket science, its bash! https://github.com/epics-containers/ioc-template/blob/main/template/ioc/liveness.sh

But also I guess you need to include it in your manifest: https://github.com/epics-containers/ec-helm-charts/blob/4328ce6e933a7cd60e97747085f18e998991384d/Charts/ioc-instance/templates/deployment.yaml#L71-L77

DiamondJoseph commented 1 month ago

blueapi will be a bit simpler, we can define a fastapi endpoint that checks that the message bus is still connected (or general health?) and configure the standard Kubernetes livenessProbe:

    livenessProbe:
      httpGet:
        path: /health
        port: 8000
        httpHeaders:
      initialDelaySeconds: 3
      periodSeconds: 3

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

gilesknap commented 1 month ago

nice. I wish there was a caGet livenessProbe !

callumforrester commented 1 month ago

So the original premise of this ticket was valid, the pod should still fall over if it doesn't connect and the configuration says it should, but we also need a new ticket for a liveness probe and health check?

stan-dot commented 1 month ago

yes, that rings true. I'd change the title for this ticket though - 'make pod fail if not connecting to rmq'

callumforrester commented 1 month ago

I have just realised that we don't need the liveness probe. If the connection is lost, blueapi's current behavoir is to rety connection indefinitely: https://github.com/DiamondLightSource/blueapi/blob/54b04966ef3ee48f6774122954f1c01b08ff7200/src/blueapi/messaging/stomptemplate.py#L221

We should instead change it so it exits the application after a number of attempts.

I'll make two new issues, this one has become very cluttered.

callumforrester commented 1 month ago

...in fact, we don't even need to do that, kubernetes would just minic the existing behaviour, i.e. spawn a new pod that would attempt to reconnect, just as the old pod was doing.

We only need one issue, closing this, continued in #496

DiamondJoseph commented 1 month ago

What about if we lose a connection during a long running scan?

callumforrester commented 1 month ago

Good point, we also saw blueapi keep running without a connection yesterday...

callumforrester commented 1 month ago

@DiamondJoseph Currently if connection was lost during a scan, an exception would raise and the scan would be aborted. I think this is correct: The argument goes that the connection is essential to preventing data loss, so without it, there is no point in continuing the scan.

I'm going to throw something else out there based on a recent discussion with @tpoliaw

We generally risk data loss with the current pub/sub architecture, in exchange we don't get much decoupling. Perhaps the long-term solution is another service to which blueapi talks directly and synchronously. That way, we know each document will be persisted. If blueapi fails to persist a document via this service it will abort the scan.