canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
99 stars 48 forks source link

PodSpec charms are stuck during upgrade from 1.8 to 1.9 #985

Open orfeas-k opened 1 month ago

orfeas-k commented 1 month ago

EDIT: Tracker issue for https://bugs.launchpad.net/juju/+bug/2073529

Bug Description

Trying out the upgrade path from 1.8 to latest/edge, it looks like PodSpec charms are stuck in

envoy                      res:oci-image@cc06b3e    unknown    0/1  envoy                    latest/edge   253  10.0.3.103    no       
katib-controller           res:oci-image@31ccd70    unknown    0/1  katib-controller         latest/edge   700  10.0.16.155   no       
kubeflow-volumes           res:oci-image@2261827    unknown    0/1  kubeflow-volumes         latest/edge   326  10.0.14.97    no       

with no units up. Looking at the pods, it looks like there are still the operator PodSpec pods (with 1 container). This pod is expected to die during the juju refresh charm command. All 3 pods have the following logs which mention a failure to download

2024-07-12 10:12:13 ERROR juju.worker.dependency engine.go:695 "operator" manifold worker returned unexpected error: error downloading updated charm ch:amd64/focal/envoy-194: failed to download charm "ch:amd64/focal/envoy-253" from API server: Get https://controller-service.controller-aks-controller.svc.cluster.local:17070/model/9f0ad959-e4b1-409c-8ab5-91f4c6eed235/charms?file=%2A&url=ch%3Aamd64%2Ffocal%2Fenvoy-253: cannot retrieve charm: ch:amd64/focal/envoy-253

which though is followed by a download complete and verified message.

This results in the charms being in a stuck state where juju scale-application 0/1 doesn't change anything. At the same time, I cannot juju refresh to the previous version due to ERROR cannot downgrade from v2 charm format to v1 (in order to retry to refresh to newer one again).

Here's also the final juju status (from the whole upgrade)

To Reproduce

  1. Create AKS cluster https://charmed-kubeflow.io/docs/create-aks-cluster-for-mlops
  2. Deploy kubeflow 1.8/stable https://charmed-kubeflow.io/docs/deploy-charmed-kubeflow-to-aks#heading--set-up-juju
  3. Try out upgrade path according to doc https://docs.google.com/document/d/1Wg32O5PF8RMy7ng7hY9gX37lHnwmszyBt4D2lI_MSjQ/edit#heading=h.5f400uypxd67. For this bug, only the part of "Rest of PodSpec charms" is required.

Environment

AKS 1.29 Juju 3.4.4

Relevant Log Output

see above

Additional Context

No response

syncronize-issues-to-jira[bot] commented 1 month ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6003.

This message was autogenerated

orfeas-k commented 1 month ago

Update

I also tried to remove the charms completely and then redeploy from latest/edge (twice on different clusters) and they all ended up in the same stuck state unknown 0/1 (with the -operator pods still being there). The second time, I noticed that for katib-controller and envoy the -operator had been deleted before redploying but somehow they reappeared when I tried to deploy. Didn't notice if the statefulSet was there all along.

I even deleted the previous sts and then redeployed the charms but that didn't help them unblock.

orfeas-k commented 1 month ago

Restart the controller

Restarted the controller by deleting its pod by it didn't help unblock the charms

orfeas-k commented 1 month ago

Juju bug

This now is essentially a tracker issue for https://bugs.launchpad.net/juju/+bug/2073529