Closed Sponge-Bas closed 2 years ago
Sorry @Basdbruijne it a while to get back to you on this. Is this still happening now that we've released 1.6/stable
?
If yes, I'm curious about kubectl describe pod katib-ui-bf5875974-5pmv7
(or whatever the katib-ui's workload is named in the new env). Specifically, I'm wondering if it is stuck pulling the image or if something else happened. That has gotten me in the past, though never with katib-ui (I don't think the image is very large) and usually not repeatedly between deployments.
Either way, am I right that you're saying because katib-ui is stuck it has prevented istio-pilot from supporting the rest of your deployment (eg: there's some other page not accessible, etc)? I think that's what I see here, specifically because istio-gateway looks non-functional, but trying to make sure
Either way, am I right that you're saying because katib-ui is stuck it has prevented istio-pilot from supporting the rest of your deployment (eg: there's some other page not accessible, etc)? I think that's what I see here, specifically because istio-gateway looks non-functional, but trying to make sure
Yes that sounds right to me. I will schedule some deployments with 1.6/stable to see if the problem is fixes.
Hello @Basdbruijne any updates on this?
Hi @DomFleischmann, we did not see this problem since switching to 1.6/stable
so I think we can close this bug.
In this testrun: https://solutions.qa.canonical.com/testruns/testRun/091e5168-80ac-4ed5-886c-476f47ce8b84, which is kubflow 1.6/beta on baremetal charmed k8s 1.22, the deployment dies with the following status:
There are several problems but I want to focus on katib-ui because the problems with this charm are consistent across re-deployments. Firstly, katib-ui takes very long to get a pod. This test run stopped after 5 minutes due to the kfp-api error, but the previous deployment had the katib-ui pod stuck on
Init:0/1
for about 50 min before it came up. I'm not sure what the root cause of this is.Secondly, when katib-ui comes up it gets stuck on status 'executing' with message '(leader-elected)':
This holds up the istio charms which then hold up the oidc-gatekeeper charm. I think the problem here is that the charm is missing a refresh action for this specific message.
The logs for this testrun can be found here: https://oil-jenkins.canonical.com/artifacts/091e5168-80ac-4ed5-886c-476f47ce8b84/index.html
I think the files of interest are:
This is an automated test run and the environment was torn down when the error was encountered. If more information is needed, please let me know what exactly we need and I can collect this information in the next test run.