Open gilesknap opened 1 month ago
@gilesknap The case we are considering is a failing IOC. I do the following to get a failing ioc-instance:
In the config ioc.yaml - I make an entity "goodluck" that ibek will not recognise
entities:
- type: epics.EpicsCaMaxArrayBytes
max_bytes: 6000000
- type: devIocStats.iocAdminSoft
IOC: bl01c-di-dcam-01
- type: ADAravis.aravisCamera
ADDR: 0
CLASS: AVT_Mako_G234B
ID: bl01c-di-dcam-01
P: BL01C-DI-DCAM-01
PORT: D1.CAM
R: ":CAM:"
TIMEOUT: 1
- type: goodluck ##### Here #####
ADDR: 0
NDARRAY_ADDR: 0
NDARRAY_PORT: D1.CAM
P: BL01C-DI-DCAM-01
PORT: D1.roi
QUEUE: 16
R: ":ROI:"
TIMEOUT: 1
I deploy this to my personal namespace in argus. This results in these logs:
ValidationError: 3 validation errors for NewIOC
entities.3
Input tag 'goodluck' found using 'type' does not match any of the expected
tags: 'ADAravis.aravisCamera', 'ADAravis.aravisSettings', 'ADCore.NDFileNexus',
'ADCore.NDFFT', 'ADCore.NDPosPlugin', 'ADCore.NDOverlay',
'ADCore.NDColorConvert', 'ADCore.NDFileHDF5', 'ADCore.NDFileNull',
'ADCore.NDStdArrays', 'ADCore._NDCircularBuff', 'ADCore.NDFileMagick',
'ADCore.NDCircularBuff', 'ADCore.NDAttrPlot', 'ADCore.NDCodec',
'ADCore.NDGather', 'ADCore.NDROI', 'ADCore.NDAttribute', 'ADCore.NDStats',
'ADCore.NDTimeSeries', 'ADCore.NDAttributes', 'ADCore.NDProcess',
'ADCore.NDFileTIFF', 'ADCore.NDGather8', 'ADCore.NDROIStat',
'ADCore.NDFileNetCDF', 'ADCore.NDPvaPlugin', 'ADCore.NDTransform',
'ADCore.NDFileJPEG', 'ADCore.NDScatter', 'asyn.AsynIP', 'asyn.AsynIPServer',
'asyn.Vxi11', 'asyn.AsynSerial', 'autosave.Autosave',
'epics.EpicsCaMaxArrayBytes', 'epics.EpicsTsMinWest', 'epics.dbpf',
'epics.EpicsEnvSet', 'epics.StartupCommand', 'epics.PostStartupCommand',
'epics.InterruptVectorVME', 'devIocStats.devIocStatsHelper',
'devIocStats.iocAdminVxWorks', 'devIocStats.iocAdminScanMon',
'devIocStats.iocGui', 'devIocStats.iocAdminSoft' [type=union_tag_invalid,
input_value={'type': 'goodluck', 'ADD...: ':ROI:', 'TIMEOUT': 1},
input_type=dict]
For further information visit
https://errors.pydantic.dev/2.8/v/union_tag_invalid
entities.4.`ADCore.NDStats`
Value error, object D1.roi not found in ['D1.CAM', 'D1.stat']
[type=value_error, input_value={'type': 'ADCore.NDStats'...ZE': 1292, 'YSIZE':
964}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.8/v/value_error
entities.5.`ADCore.NDStdArrays`
Value error, object D1.roi not found in ['D1.CAM', 'D1.stat', 'D1.arr']
[type=value_error, input_value={'type': 'ADCore.NDStdArr...OUT': 1, 'TYPE':
'Int8'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.8/v/value_error
after 3 minutes I get an email
description = Pod esq51579/bl01c-di-dcam-01-0 (bl01c-di-dcam-01) is in waiting state (reason: "CrashLoopBackOff").
At this point there were 5 restarts
after 15 minutes I get an email
Annotations
description = StatefulSet esq51579/bl01c-di-dcam-01 has not matched the expected number of replicas for longer than 15 minutes.
After 15 minutes after I get the same. After 30 minutes another. etc.
Discussion:
readiness
state is a gate to allow traffic from a service. Since we dont use services for our IOCs, im not sure this is useful? Would it not just make extra error messages for ready not reached? I dont believe this would stop crash looping, and as per 3. im not sure why we would want it toSlow Starting Containers: Startup probes should be used when the application in your container could take a significant amount of time to reach its normal operating state. Applications that would crash or throw an error if they handled a liveness or readiness probe during startup need to be protected by a startup probe.
This might be useful so we can have a more sensitive readyness probe as I believe that PVs aren't immediately gettable. If the startup probe fails, the kubelet kills the container, and the container is subjected to its restartPolicy
. for startup probe I picture a slow loop doing the readiness probe that has a couple attempts before failing over. Note this doesnt address the issue here.Our liveness check definition does have "initialDelaySeconds":120
including @tghartland as my assertion that we need a readiness probe is based on a conversation in a helpdesk issue. (unfortunately I asked Thomas to resolve that issue and now cannot find it).
I don't think 15 minute emails are acceptable because they go to lots of people and therefore waste lots of people's time. So I think we want to fix this.
Thomas, please can you remind me of your reasoning for why you believed we needed a readiness probe to resolve this?
The ticket Giles mentions was for one IOC pod in p45 crash looping because one of the other devices it connects to was powered off. The crash looping part was handled normally, but as the pod was becoming Ready each time before crashing, it was causing a flood of emails due to the overall readiness of the statefulset flipping back and forth.
I'll copy my analysis from that ticket:
Yes it looks like the statefulset alert is the one causing the issues. Looking at the query used to calculate that alert, it was disappearing (which for this purpose means resolving) fairly regularly: There are three metrics that go into this query, the one that was causing it to resolve was kube_statefulset_status_replicas_ready, which was changing to the expected value of 1 ready replica: At the same time in the logs, it looks like the pod was starting up: but hitting this connection error you can see in the logs there. That IP address is bl45p-mo-panda-01.diamond.ac.uk, which I assume was powered off on the beamline. The "kubernetes native" fix here would be to implement a readiness probe (I see there is a liveness probe already but not readiness probe) which does not return true until the pod has verified that its dependencies are available and has successfully started. Then if this happened again, the pod would still crash loop and you'd get the same trickle of emails for those alerts, but it would be crashing before marking itself as Ready and therefore not resolving the alert on the statefulset every loop.
In the same way as a standard webserver readiness probe is to have it make a get request to 127.0.0.1/health
, ensuring that the webserver is up and responding, does it make sense to have one that makes a local channel access request inside the pod to determine readiness?
@marcelldls when you say IOCs are not self healing and need manual input to restart, does that include this situation above where the IOC is failing to start at all until another dependency comes up? Or would this be a case where the try-until-success loop works and is desirable?
Even though the main functional purpose of Readiness is to indicate being ready to receive requests (from kubernetes services), I think this state is prolific in enough interfaces (get pods
, dashboards) and the monitoring/alerting stack to be worth keeping as meaningful-ish. If you could work with it as if it were "ready to handle channel access" I think that would help keep the mental model consistent, even if it doesn't make any functional difference.
Regarding self healing. The example Marcell used was to deploy a broken IOC and that would clearly not self heal. But once a working IOC has been deployed the most likely cause of boot loops is that its device is unavailable. We would want it to keep trying until its device became available in that instance.
But even for broken IOCs, reducing the amount of alerts is still desirable. And it seems the readiness probe can allow that.
When an IOC is failing we get many many messages from K8S.
That is because the IOC takes long enough to start and crash that K8S defaults to considering it READY.
We should add a readiness probe along the same lines as our liveness probe except that it loops until the PV is available and exits with fail after 30 secs or so if it does not become available.
Warning. If the IOC crashes after ioc_init then the status PVs may appear briefly so we''d need to cope with that - perhaps make sure the PV stays available for some count of seconds before returning success.