iocs need a readyness probe

gilesknap commented 1 month ago

When an IOC is failing we get many many messages from K8S.

That is because the IOC takes long enough to start and crash that K8S defaults to considering it READY.

We should add a readiness probe along the same lines as our liveness probe except that it loops until the PV is available and exits with fail after 30 secs or so if it does not become available.

Warning. If the IOC crashes after ioc_init then the status PVs may appear briefly so we''d need to cope with that - perhaps make sure the PV stays available for some count of seconds before returning success.

marcelldls commented 1 month ago

@gilesknap The case we are considering is a failing IOC. I do the following to get a failing ioc-instance:

In the config ioc.yaml - I make an entity "goodluck" that ibek will not recognise

entities:
  - type: epics.EpicsCaMaxArrayBytes
    max_bytes: 6000000

  - type: devIocStats.iocAdminSoft
    IOC: bl01c-di-dcam-01

  - type: ADAravis.aravisCamera
    ADDR: 0
    CLASS: AVT_Mako_G234B
    ID: bl01c-di-dcam-01
    P: BL01C-DI-DCAM-01
    PORT: D1.CAM
    R: ":CAM:"
    TIMEOUT: 1

  - type: goodluck   ##### Here #####
    ADDR: 0
    NDARRAY_ADDR: 0
    NDARRAY_PORT: D1.CAM
    P: BL01C-DI-DCAM-01
    PORT: D1.roi
    QUEUE: 16
    R: ":ROI:"
    TIMEOUT: 1

I deploy this to my personal namespace in argus. This results in these logs:

ValidationError: 3 validation errors for NewIOC
entities.3
  Input tag 'goodluck' found using 'type' does not match any of the expected 
tags: 'ADAravis.aravisCamera', 'ADAravis.aravisSettings', 'ADCore.NDFileNexus', 
'ADCore.NDFFT', 'ADCore.NDPosPlugin', 'ADCore.NDOverlay', 
'ADCore.NDColorConvert', 'ADCore.NDFileHDF5', 'ADCore.NDFileNull', 
'ADCore.NDStdArrays', 'ADCore._NDCircularBuff', 'ADCore.NDFileMagick', 
'ADCore.NDCircularBuff', 'ADCore.NDAttrPlot', 'ADCore.NDCodec', 
'ADCore.NDGather', 'ADCore.NDROI', 'ADCore.NDAttribute', 'ADCore.NDStats', 
'ADCore.NDTimeSeries', 'ADCore.NDAttributes', 'ADCore.NDProcess', 
'ADCore.NDFileTIFF', 'ADCore.NDGather8', 'ADCore.NDROIStat', 
'ADCore.NDFileNetCDF', 'ADCore.NDPvaPlugin', 'ADCore.NDTransform', 
'ADCore.NDFileJPEG', 'ADCore.NDScatter', 'asyn.AsynIP', 'asyn.AsynIPServer', 
'asyn.Vxi11', 'asyn.AsynSerial', 'autosave.Autosave', 
'epics.EpicsCaMaxArrayBytes', 'epics.EpicsTsMinWest', 'epics.dbpf', 
'epics.EpicsEnvSet', 'epics.StartupCommand', 'epics.PostStartupCommand', 
'epics.InterruptVectorVME', 'devIocStats.devIocStatsHelper', 
'devIocStats.iocAdminVxWorks', 'devIocStats.iocAdminScanMon', 
'devIocStats.iocGui', 'devIocStats.iocAdminSoft' [type=union_tag_invalid, 
input_value={'type': 'goodluck', 'ADD...: ':ROI:', 'TIMEOUT': 1}, 
input_type=dict]
    For further information visit 
https://errors.pydantic.dev/2.8/v/union_tag_invalid
entities.4.`ADCore.NDStats`
  Value error, object D1.roi not found in ['D1.CAM', 'D1.stat'] 
[type=value_error, input_value={'type': 'ADCore.NDStats'...ZE': 1292, 'YSIZE': 
964}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error
entities.5.`ADCore.NDStdArrays`
  Value error, object D1.roi not found in ['D1.CAM', 'D1.stat', 'D1.arr'] 
[type=value_error, input_value={'type': 'ADCore.NDStdArr...OUT': 1, 'TYPE': 
'Int8'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error

after 3 minutes I get an email

description = Pod esq51579/bl01c-di-dcam-01-0 (bl01c-di-dcam-01) is in waiting state (reason: "CrashLoopBackOff").

At this point there were 5 restarts

after 15 minutes I get an email

Annotations
description = StatefulSet esq51579/bl01c-di-dcam-01 has not matched the expected number of replicas for longer than 15 minutes.

After 15 minutes after I get the same. After 30 minutes another. etc.

Discussion:

Personally I would have wanted to know sooner if it were crash looping. It seems that if there were only a couple of restarts, I wouldnt even be aware my application had a failure.
Every 15 minutes forever is annoying, but if its not annoying theres no reason to fix it. I can see it would be useful to have rules/subscriptions for the emails when you are not active in that cluster - But im not sure that we should change the behaviour of the pod to reduce notifications
You have suggested adding a readiness probe. Apparently, "if the Readiness probe fails, but there will be no restart". I was not aware IOC are self healing? My understanding is if they fail we need to restart them?
My understanding is that the readiness state is a gate to allow traffic from a service. Since we dont use services for our IOCs, im not sure this is useful? Would it not just make extra error messages for ready not reached? I dont believe this would stop crash looping, and as per 3. im not sure why we would want it to
Perhaps in ec we should be looking at running rather than ready for IOC health
I have been reminded of "Startup probe" which might be useful: Slow Starting Containers: Startup probes should be used when the application in your container could take a significant amount of time to reach its normal operating state. Applications that would crash or throw an error if they handled a liveness or readiness probe during startup need to be protected by a startup probe. This might be useful so we can have a more sensitive readyness probe as I believe that PVs aren't immediately gettable. If the startup probe fails, the kubelet kills the container, and the container is subjected to its restartPolicy. for startup probe I picture a slow loop doing the readiness probe that has a couple attempts before failing over. Note this doesnt address the issue here.

marcelldls commented 1 month ago

Our liveness check definition does have "initialDelaySeconds":120

gilesknap commented 2 weeks ago

including @tghartland as my assertion that we need a readiness probe is based on a conversation in a helpdesk issue. (unfortunately I asked Thomas to resolve that issue and now cannot find it).

I don't think 15 minute emails are acceptable because they go to lots of people and therefore waste lots of people's time. So I think we want to fix this.

Thomas, please can you remind me of your reasoning for why you believed we needed a readiness probe to resolve this?

tghartland commented 2 weeks ago

The ticket Giles mentions was for one IOC pod in p45 crash looping because one of the other devices it connects to was powered off. The crash looping part was handled normally, but as the pod was becoming Ready each time before crashing, it was causing a flood of emails due to the overall readiness of the statefulset flipping back and forth.

I'll copy my analysis from that ticket:

Yes it looks like the statefulset alert is the one causing the issues. Looking at the query used to calculate that alert, it was disappearing (which for this purpose means resolving) fairly regularly: There are three metrics that go into this query, the one that was causing it to resolve was kube_statefulset_status_replicas_ready, which was changing to the expected value of 1 ready replica: At the same time in the logs, it looks like the pod was starting up: but hitting this connection error you can see in the logs there. That IP address is bl45p-mo-panda-01.diamond.ac.uk, which I assume was powered off on the beamline. The "kubernetes native" fix here would be to implement a readiness probe (I see there is a liveness probe already but not readiness probe) which does not return true until the pod has verified that its dependencies are available and has successfully started. Then if this happened again, the pod would still crash loop and you'd get the same trickle of emails for those alerts, but it would be crashing before marking itself as Ready and therefore not resolving the alert on the statefulset every loop.

In the same way as a standard webserver readiness probe is to have it make a get request to 127.0.0.1/health, ensuring that the webserver is up and responding, does it make sense to have one that makes a local channel access request inside the pod to determine readiness?

@marcelldls when you say IOCs are not self healing and need manual input to restart, does that include this situation above where the IOC is failing to start at all until another dependency comes up? Or would this be a case where the try-until-success loop works and is desirable?

Even though the main functional purpose of Readiness is to indicate being ready to receive requests (from kubernetes services), I think this state is prolific in enough interfaces (get pods, dashboards) and the monitoring/alerting stack to be worth keeping as meaningful-ish. If you could work with it as if it were "ready to handle channel access" I think that would help keep the mental model consistent, even if it doesn't make any functional difference.

gilesknap commented 2 weeks ago

Regarding self healing. The example Marcell used was to deploy a broken IOC and that would clearly not self heal. But once a working IOC has been deployed the most likely cause of boot loops is that its device is unavailable. We would want it to keep trying until its device became available in that instance.

But even for broken IOCs, reducing the amount of alerts is still desirable. And it seems the readiness probe can allow that.

an aside re IOC failure behaviour

There is a problem that many support modules just report an error and sit at the prompt if they fail to find their devices. This is counter productive because it requires a manual restart once the device is made available.
Therefore I would eventually like to work on making all support modules have a mode in which they exit when failing to connect to their device.
The alternative is to keep retrying to connect, BUT:-
- that is just too hard for many badly written support modules to handle well (e.g. pmac!)
- I prefer the restart anyway because it hands over the fallback management and failure reporting to kubernetes

epics-containers / ioc-template

iocs need a readyness probe #22

an aside re IOC failure behaviour