canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
103 stars 50 forks source link

Kubeflow does not handle evicted Pods correctly #533

Open Barteus opened 1 year ago

Barteus commented 1 year ago

When you leave Charmed Kubeflow (CKF) running for some time - Pods eviction happens. The evicted Pods stay in the system until the GC of evicted Pods is invoked. Based on the default value of vanilla Kubernetes it happens when 12.500 Pods Evicted Pods are in the system.

On the running CKF when Eviction happens Pods are left in juju which means that the leadership is not transferred from Evicted Pod to the newly created one.

Reproduce:

  1. Deploy Charmed Kubeflow on Kubernetes with 2 Nodes (make sure that Kubeflow Pods are placed on both Nodes)
  2. Drain one of the Nodes
  3. Wait for Pods to be moved to another Node.
  4. Check juju status

Workaround: Manually remove all Evicted Pods

ca-scribner commented 1 year ago

@Barteus you're talking about eviction of the charm operator pods, right? Not the underlying kubeflow workload pods (like the actual kfp-api workload, etc)?

I wonder whether we're missing some logic in our charms, or if Juju is mishandling something

i-chvets commented 1 year ago

Needs investigration.

DnPlas commented 1 year ago

This issue requires us to go through the charms and see which ones are affected.