Open nobuto-m opened 2 years ago
I tried to reproduce your issue but in my case the trials get completed without patching the katib deployment:
The run fails due to kfserving being unavailable, but there seems to be no issue with katib.
Could you share your notebook configuration (e.g. image, cpu, ram)?
I tried to reproduce your issue but in my case the trials get completed without patching the katib deployment:
Hmm, interesting. I had the timeout 100% so far.
The run fails due to kfserving being unavailable, but there seems to be no issue with katib.
Yup, kfserving failure is expected.
Could you share your notebook configuration (e.g. image, cpu, ram)?
I don't think of anything off the top of my head. I just used the default values and it shouldn't affect the pipeline execution since it just creates a pipeline not running it if I'm not mistaken.
Just to confirm, have you used microk8s?
Just to confirm, have you used microk8s?
Yes, I used microk8s 1.21/stable.
Hmm, trial-resources=Workflow.v1alpha1.argoproj.io
might be a red herring. The issue is still reproducible in my environment, but after adding trial-resources=Workflow.v1alpha1.argoproj.io
and removing it again, the trial jobs complete.
So the key to workaround the issue might be restarting/recreating the katib-controller pod. It's puzzling why it's reproducible on my testbed but not on the other environment though.
So the key to workaround the issue might be restarting/recreating the katib-controller pod. It's puzzling why it's reproducible on my testbed but not on the other environment though.
microk8s kubectl -n kubeflow rollout restart deployment/katib-controller
does the trick to unstuck not completing trials somehow.
The controller was restarted around 15:31.
2022-06-07T15:31:48.597347405Z stderr F 2022-06-07 15:31:48 WARNING juju.worker.caasoperator caasoperator.go:554 stopping uniter for dead unit "katib-controller/0": worker "katib-controller/0" not found
I've reproduced it successfully in a clean environment. The steps are almost identical with the one in the description. Hope it helps for you to reproduce it on your end, @natalian98
Launch an AWS instance with the following config
Create a group in advance
sudo addgroup --system microk8s
sudo adduser $USER microk8s
logout and login again
Run a script
git clone https://github.com/nobuto-m/quick-kubeflow.git
cd quick-kubeflow
git checkout 376c501
time ./redeploy-microk8s-kubeflow.sh
## -> 32 min
Connect from local laptop/desktop
sshuttle -r ubuntu@PUBLIC_IP_OF_AWS_INSTANCE 10.64.140.43
Then open http://10.64.140.43.nip.io/
Create a Jupyter notebook instance
Import and run the notebook https://raw.githubusercontent.com/kubeflow/katib/fe2ae99d5b8c58a0f56221bb9a58afc131bfafc4/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb
Thanks for providing the detailed steps @nobuto-m.
I deployed kubeflow once more using your instructions and the trials still succeed, provided that the notebook's CPU and memory values are increased.
When using the default CPU==0.5 and memory==1Gi, some garbage collection errors can be observed, which may be the reason why the trials get stuck. Katib-controller wasn't restarted.
After creating a notebook with 2 CPUs and 4Gi memory, the trials succeed.
I deployed kubeflow once more using your instructions and the trials still succeed, provided that the notebook's CPU and memory values are increased.
When using the default CPU==0.5 and memory==1Gi, some garbage collection errors can be observed, which may be the reason why the trials get stuck. Katib-controller wasn't restarted.
After creating a notebook with 2 CPUs and 4Gi memory, the trials succeed.
Thanks for testing. I bumped CPU and memory to 4 CPUs and 8Gi memory, but still it doesn't work for me.
Also, I'm confused because I thought a notebook instance was irrelevant to the pipeline run. The pipeline is defined from the notebook, but the notebook instance can be deleted before re-running the pipeline if I'm not mistaken. So I'm wondering how the spec of the notebook instance affects the pipeline. Am I missing something?
some garbage collection errors can be observed
Which log did you see this in? It might not be the notebook instance, but if there was any error from any component, we can dig in.
I am experiencing similar failure. Katib-controller would deploy successfully. The unit would goes into crashloopbackoff after submitting an autoML experiment with all the default configurations (pod logs provided below). Patching the katib-controller deployment with trial resources seems to resolve that error. Restarting the deployment doesn't work for me.
How to reproduce (based on the quickstart doc):
access-ml-pipeline: "true"
for PodDefault)Allow access to Kubeflow Pipelines
in Kubeflow UIExpected: All trials are complete.
Solution/workaround:
Once
Workflow.v1alpha1.argoproj.io
is added to trial-resources of katib-controller. All trials are complete. https://github.com/kubeflow/katib/blob/master/examples/v1beta1/argo/README.md#katib-controller Since both katib-controller and argo are managed by Juju charms, it would be good to see some improvements in this user scenario. For the record, Katib Metrics Collector sidecar injection is enabled out of the box.OR
Simply restart katib-controller without changing anything. Adding
Workflow.v1alpha1.argoproj.io
might be a red herring since it actually recreate a pod.[out of the box]
[patched]