equinor / radix-api

MIT License
2 stars 0 forks source link

Extend replicaStatus object with information about evicted pods caused by exceeding emptyDir sizeLimit #621

Open nilsgstrabo opened 6 months ago

nilsgstrabo commented 6 months ago

Related to use ReadOnlyFileSystem

If an emptyDir in a container exceeds the sizeLimit, Kubernetes will forcefully kill the container and set the pod phase to Failed. K8S then creates a new Pod to run the container istead of reusing the existing Pod.

We should include info about these events (stored in pod.status) in replicaList returned by radix-api.

Another issue is that these failed pods interfere with the caulculation of the component status. A pod in a failed phase due to emptyDir violations will cause the component status to be Reconciling. Not sure what the status should be. The user should be able to easily see that there are issuer, but I feel that Reconciling is wrong. A component can be in one of the following statuses: "Stopped", "Consistent", "Reconciling", "Restarting", "Outdated". Not sure if any of the fit this situation.

Also, it would be useful for the user to be able to cleanup(delete?) Pods in failed state.

An example of a failed Pod:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-04-23T11:26:40Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-04-23T11:28:42Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-04-23T11:28:42Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-04-23T11:26:40Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://ae1a03c55355b5f5bc29b6c6990727ee87a6d8b845e2d8a795487d78bc193712
    image: radixdev.azurecr.io/oauth-demo-dev-simple:e2pmj
    imageID: radixdev.azurecr.io/oauth-demo-dev-simple@sha256:ce0827dd93e2dc2d96ac7a941cc2bbf9687ec5e93a8c8218a1d90784c8484859
    lastState: {}
    name: simple
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://ae1a03c55355b5f5bc29b6c6990727ee87a6d8b845e2d8a795487d78bc193712
        exitCode: 137
        finishedAt: "2024-04-23T11:28:42Z"
        reason: Error
        startedAt: "2024-04-23T11:26:41Z"
  hostIP: 10.5.3.108
  message: 'Usage of EmptyDir volume "radix-vm-tmp" exceeds the limit "5M". '
  phase: Failed
  podIP: 10.5.3.130
  podIPs:
  - ip: 10.5.3.130
  qosClass: Burstable
  reason: Evicted
  startTime: "2024-04-23T11:26:40Z"
emirgens commented 2 months ago

Investigate if Pod retention period can be used https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/