AutoML jobs times out in pipeline until katib-controller is restarted

nobuto-m commented 2 years ago

How to reproduce (based on the quickstart doc):

sudo snap install microk8s --classic --channel 1.21
microk8s enable dns storage ingress metallb:10.64.140.43-10.64.140.49
deploy kubeflow: juju bootstrap microk8s juju add-model kubeflow juju deploy --trust kubeflow juju config dex-auth public-url=http://10.64.140.43.nip.io juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io juju config dex-auth static-username=admin juju config dex-auth static-password=admin
juju refresh kfp-profile-controller --channel edge (to apply access-ml-pipeline: "true" for PodDefault)
Create a Jupyter notebook with Allow access to Kubeflow Pipelines in Kubeflow UI
Import an example kubeflow pipeline notebook and run it from the top: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb
An experiment (AutoML) job is created as a part of the pipeline, but no trial is complete and the run eventually hits timeout (60min)

Expected: All trials are complete.

Solution/workaround:

Once Workflow.v1alpha1.argoproj.io is added to trial-resources of katib-controller. All trials are complete. https://github.com/kubeflow/katib/blob/master/examples/v1beta1/argo/README.md#katib-controller Since both katib-controller and argo are managed by Juju charms, it would be good to see some improvements in this user scenario. For the record, Katib Metrics Collector sidecar injection is enabled out of the box.

OR

Simply restart katib-controller without changing anything. Adding Workflow.v1alpha1.argoproj.io might be a red herring since it actually recreate a pod.

kubectl -n kubeflow rollout restart deployment/katib-controller

$ microk8s kubectl get namespace admin -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    owner: admin
  creationTimestamp: "2022-06-05T14:29:50Z"
  labels:
    app.kubernetes.io/part-of: kubeflow-profile
    istio-injection: enabled
    katib-metricscollector-injection: enabled
    katib.kubeflow.org/metrics-collector-injection: enabled
    kubernetes.io/metadata.name: admin
    pipelines.kubeflow.org/enabled: "true"
    serving.kubeflow.org/inferenceservice: enabled
...

[out of the box]

$ pgrep -af katib-controller
54059 /bin/sh -c export JUJU_DATA_DIR=/var/lib/juju export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools  mkdir -p $JUJU_TOOLS_DIR cp /opt/jujud $JUJU_TOOLS_DIR/jujud  $JUJU_TOOLS_DIR/jujud caasoperator --application-name=katib-controller --debug 
54095 /var/lib/juju/tools/jujud caasoperator --application-name=katib-controller --debug
80781 ./katib-controller --webhook-port=443 --trial-resources=Job.v1.batch --trial-resources=TFJob.v1.kubeflow.org --trial-resources=PyTorchJob.v1.kubeflow.org --trial-resources=MPIJob.v1.kubeflow.org --trial-resources=PipelineRun.v1beta1.tekton.dev

[patched]

$ kubectl patch Deployment katib-controller -n kubeflow --type=json \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--trial-resources=Workflow.v1alpha1.argoproj.io"}]'

$ pgrep -af katib-controller
54059 /bin/sh -c export JUJU_DATA_DIR=/var/lib/juju export JUJU_TOOLS_DIR=$JUJU_DATA_DIR/tools  mkdir -p $JUJU_TOOLS_DIR cp /opt/jujud $JUJU_TOOLS_DIR/jujud  $JUJU_TOOLS_DIR/jujud caasoperator --application-name=katib-controller --debug 
54095 /var/lib/juju/tools/jujud caasoperator --application-name=katib-controller --debug
808728 ./katib-controller --webhook-port=443 --trial-resources=Job.v1.batch --trial-resources=TFJob.v1.kubeflow.org --trial-resources=PyTorchJob.v1.kubeflow.org --trial-resources=MPIJob.v1.kubeflow.org --trial-resources=PipelineRun.v1beta1.tekton.dev --trial-resources=Workflow.v1alpha1.argoproj.io

natalian98 commented 2 years ago

I tried to reproduce your issue but in my case the trials get completed without patching the katib deployment:

The run fails due to kfserving being unavailable, but there seems to be no issue with katib.

Could you share your notebook configuration (e.g. image, cpu, ram)?

nobuto-m commented 2 years ago

I tried to reproduce your issue but in my case the trials get completed without patching the katib deployment:

Hmm, interesting. I had the timeout 100% so far.

The run fails due to kfserving being unavailable, but there seems to be no issue with katib.

Yup, kfserving failure is expected.

Could you share your notebook configuration (e.g. image, cpu, ram)?

I don't think of anything off the top of my head. I just used the default values and it shouldn't affect the pipeline execution since it just creates a pipeline not running it if I'm not mistaken.

Just to confirm, have you used microk8s?

natalian98 commented 2 years ago

Just to confirm, have you used microk8s?

Yes, I used microk8s 1.21/stable.

nobuto-m commented 2 years ago

Hmm, trial-resources=Workflow.v1alpha1.argoproj.io might be a red herring. The issue is still reproducible in my environment, but after adding trial-resources=Workflow.v1alpha1.argoproj.io and removing it again, the trial jobs complete.

So the key to workaround the issue might be restarting/recreating the katib-controller pod. It's puzzling why it's reproducible on my testbed but not on the other environment though.

nobuto-m commented 2 years ago

So the key to workaround the issue might be restarting/recreating the katib-controller pod. It's puzzling why it's reproducible on my testbed but not on the other environment though.

microk8s kubectl -n kubeflow rollout restart deployment/katib-controller does the trick to unstuck not completing trials somehow.

The controller was restarted around 15:31.

2022-06-07T15:31:48.597347405Z stderr F 2022-06-07 15:31:48 WARNING juju.worker.caasoperator caasoperator.go:554 stopping uniter for dead unit "katib-controller/0": worker "katib-controller/0" not found

containers.log

nobuto-m commented 2 years ago

I've reproduced it successfully in a clean environment. The steps are almost identical with the one in the description. Hope it helps for you to reproduce it on your end, @natalian98

Launch an AWS instance with the following config
- Instance type: t3.2xlarge
- AMI ID: ami-0a3eb6ca097b78895
- AMI name: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20220419
- storage: 64GB

Create a group in advance

sudo addgroup --system microk8s
sudo adduser $USER microk8s

logout and login again
Run a script

git clone https://github.com/nobuto-m/quick-kubeflow.git
cd quick-kubeflow
git checkout 376c501
time ./redeploy-microk8s-kubeflow.sh
## -> 32 min

Connect from local laptop/desktop
```
sshuttle -r ubuntu@PUBLIC_IP_OF_AWS_INSTANCE 10.64.140.43
```
Then open http://10.64.140.43.nip.io/
Create a Jupyter notebook instance
- name: first-notebook
- image: j1r0q0g6/notebooks/notebook-servers/jupyter-scipy:v1.4
- configuration: Allow access to Kubeflow Pipelines - ✓
Import and run the notebook https://raw.githubusercontent.com/kubeflow/katib/fe2ae99d5b8c58a0f56221bb9a58afc131bfafc4/examples/v1beta1/kubeflow-pipelines/kubeflow-e2e-mnist.ipynb

natalian98 commented 2 years ago

Thanks for providing the detailed steps @nobuto-m.

I deployed kubeflow once more using your instructions and the trials still succeed, provided that the notebook's CPU and memory values are increased.

When using the default CPU==0.5 and memory==1Gi, some garbage collection errors can be observed, which may be the reason why the trials get stuck. Katib-controller wasn't restarted.

After creating a notebook with 2 CPUs and 4Gi memory, the trials succeed.

nobuto-m commented 2 years ago

I deployed kubeflow once more using your instructions and the trials still succeed, provided that the notebook's CPU and memory values are increased.

When using the default CPU==0.5 and memory==1Gi, some garbage collection errors can be observed, which may be the reason why the trials get stuck. Katib-controller wasn't restarted.

After creating a notebook with 2 CPUs and 4Gi memory, the trials succeed.

Thanks for testing. I bumped CPU and memory to 4 CPUs and 8Gi memory, but still it doesn't work for me.

Also, I'm confused because I thought a notebook instance was irrelevant to the pipeline run. The pipeline is defined from the notebook, but the notebook instance can be deleted before re-running the pipeline if I'm not mistaken. So I'm wondering how the spec of the notebook instance affects the pipeline. Am I missing something?

some garbage collection errors can be observed

Which log did you see this in? It might not be the notebook instance, but if there was any error from any component, we can dig in.

agathanatasha commented 2 years ago

I am experiencing similar failure. Katib-controller would deploy successfully. The unit would goes into crashloopbackoff after submitting an autoML experiment with all the default configurations (pod logs provided below). Patching the katib-controller deployment with trial resources seems to resolve that error. Restarting the deployment doesn't work for me.

logs from katib-controller

``` ubuntu@ip-172-31-26-151:~$ uk -n kubeflow logs katib-controller-f5f96cfdd-z6pbd {"level":"info","ts":1659111913.112697,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","webhook-port":443,"metrics-addr":":8080","inject-security-context":false,"enable-grpc-probe-in-suggestion":true,"trial-resources":[{"Group":"batch","Version":"v1","Kind":"Job"},{"Group":"kubeflow.org","Version":"v1","Kind":"TFJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"PyTorchJob"},{"Group":"kubeflow.org","Version":"v1","Kind":"MPIJob"},{"Group":"tekton.dev","Version":"v1beta1","Kind":"PipelineRun"}]} I0729 16:25:14.165000 1 request.go:655] Throttling request took 1.040866117s, request: GET:https://10.152.183.1:443/apis/admissionregistration.k8s.io/v1?timeout=32s {"level":"info","ts":1659111914.5194237,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"} {"level":"info","ts":1659111914.5197005,"logger":"entrypoint","msg":"Registering Components."} {"level":"info","ts":1659111914.5198681,"logger":"entrypoint","msg":"Setting up controller."} {"level":"info","ts":1659111914.5198922,"logger":"experiment-controller","msg":"Using the default suggestion implementation"} {"level":"info","ts":1659111914.5199733,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5200028,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5200174,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5200253,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5200374,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.520045,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5200517,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5200558,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5200624,"logger":"experiment-controller","msg":"Experiment controller created"} {"level":"info","ts":1659111914.5200937,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.520104,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201106,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201147,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201201,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201275,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201344,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201378,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201442,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201483,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201523,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5201564,"logger":"suggestion-controller","msg":"Suggestion controller created"} {"level":"info","ts":1659111914.5202193,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5202289,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.520279,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.520288,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5203009,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5203068,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"batch","CRD Version":"v1","CRD Kind":"Job"} {"level":"info","ts":1659111914.5203474,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5203562,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5203605,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5203655,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"TFJob"} {"level":"info","ts":1659111914.5203984,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5204074,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.5204117,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111914.520415,"logger":"trial-controller","msg":"Job watch added successfully","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"PyTorchJob"} {"level":"info","ts":1659111917.416891,"logger":"trial-controller","msg":"Job watch error. CRD might be missing. Please install CRD and restart katib-controller","CRD Group":"kubeflow.org","CRD Version":"v1","CRD Kind":"MPIJob"} {"level":"info","ts":1659111920.3188462,"logger":"trial-controller","msg":"Job watch error. CRD might be missing. Please install CRD and restart katib-controller","CRD Group":"tekton.dev","CRD Version":"v1beta1","CRD Kind":"PipelineRun"} {"level":"info","ts":1659111920.318882,"logger":"trial-controller","msg":"Trial controller created"} {"level":"info","ts":1659111920.3188865,"logger":"entrypoint","msg":"Setting up webhooks."} {"level":"info","ts":1659111920.3188999,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3190033,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-experiment"} {"level":"info","ts":1659111920.31902,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3190272,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3190315,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3190677,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3190932,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-experiment"} {"level":"info","ts":1659111920.3190963,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3191006,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.319103,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3191202,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3191442,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-pod"} {"level":"info","ts":1659111920.319147,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.319153,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3191555,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3191712,"logger":"controller-runtime.injectors-warning","msg":"Injectors are deprecated, and will be removed in v0.10.x"} {"level":"info","ts":1659111920.3191738,"logger":"entrypoint","msg":"Starting the Cmd."} {"level":"info","ts":1659111920.3196013,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"} {"level":"info","ts":1659111920.3195987,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"} {"level":"info","ts":1659111920.3198805,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"} {"level":"info","ts":1659111920.3201926,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"} {"level":"info","ts":1659111920.3203773,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":443} {"level":"info","ts":1659111920.3206003,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1659111920.320677,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1659111920.3208,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1659111920.421475,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: batch/v1, Kind=Job"} {"level":"info","ts":1659111920.4215267,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1659111920.4215186,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1659111920.4215727,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1659111920.421748,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting Controller"} {"level":"info","ts":1659111920.5224676,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1659111920.5224283,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=TFJob"} {"level":"info","ts":1659111920.6236901,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting EventSource","source":"kind source: /, Kind="} {"level":"info","ts":1659111920.6236699,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting EventSource","source":"kind source: kubeflow.org/v1, Kind=PyTorchJob"} {"level":"info","ts":1659111920.724951,"logger":"controller-runtime.manager.controller.experiment-controller","msg":"Starting workers","worker count":1} {"level":"info","ts":1659111920.7249997,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting Controller"} {"level":"info","ts":1659111920.7250733,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting Controller"} {"level":"info","ts":1659111920.725114,"logger":"controller-runtime.manager.controller.suggestion-controller","msg":"Starting workers","worker count":1} {"level":"info","ts":1659111920.7251475,"logger":"controller-runtime.manager.controller.trial-controller","msg":"Starting workers","worker count":1} {"level":"info","ts":1659111920.7252057,"logger":"experiment-controller","msg":"Statistics","Experiment":"admin/random-experiment","requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0} {"level":"info","ts":1659111920.7252252,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"admin/random-experiment","addCount":3} {"level":"info","ts":1659111920.7252333,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"admin/random-experiment","name":"random-experiment","Suggestion Requests":3} E0729 16:25:20.725373 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 755 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x166dbc0, 0x2730890}) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:74 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000548ac0}) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:48 +0x75 panic({0x166dbc0, 0x2730890}) /usr/local/go/src/runtime/panic.go:1038 +0x215 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetTrialTemplate(0xc0005a9750, 0xc000ac93d0) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:199 +0xad github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).applyParameters(0x0, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x3, 0x0}) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:100 +0x66 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetRunSpecWithHyperParameters(0x1b39630, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x14, 0x300000000}) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:81 +0x45 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrialInstance(0xc000280480, 0xc000884dc0, 0xc000ac98a0) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller_util.go:61 +0x2d0 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrials(0xc0005cf8f0, 0xc000884dc0, {0x27879b8, 0x0, 0x8}, 0x8) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:358 +0x39a github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileTrials(0xc0005a0910, 0xc000884dc0, {0x27879b8, 0x0, 0x0}) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:335 +0x62c github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc000280480, 0xc000884dc0) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:281 +0x2cf github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).Reconcile(0xc000280480, {0x1aee138, 0xc0004b74a0}, {{{0xc0008d3729, 0x5}, {0xc00062b7e8, 0x11}}}) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239 +0x61c sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0007075e0, {0x1aee090, 0xc00092bd00}, {0x16cf5c0, 0xc000548ac0}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:297 +0x303 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0007075e0, {0x1aee090, 0xc00092bd00}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:252 +0x205 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2({0x1aee090, 0xc00092bd00}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:215 +0x46 k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1() /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185 +0x25 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f1895b22b98) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:155 +0x67 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x1ac26a0, 0xc0004b7440}, 0x1, 0xc000560540) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000098ae0, 0x3b9aca00, 0x0, 0x20, 0x0) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0005cf790, 0x0, 0x0, 0x20) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185 +0x99 k8s.io/apimachinery/pkg/util/wait.UntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0009adce0, 0xc0009eb728) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:99 +0x2b created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:212 +0x356 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12a456d] goroutine 755 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000548ac0}) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:55 +0xd8 panic({0x166dbc0, 0x2730890}) /usr/local/go/src/runtime/panic.go:1038 +0x215 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetTrialTemplate(0xc0005a9750, 0xc000ac93d0) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:199 +0xad github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).applyParameters(0x0, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x3, 0x0}) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:100 +0x66 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest.(*DefaultGenerator).GetRunSpecWithHyperParameters(0x1b39630, 0xc000884dc0, {0xc000290ec0, 0x1a}, {0xc0008d3729, 0x5}, {0xc00037b440, 0x14, 0x300000000}) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/manifest/generator.go:81 +0x45 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrialInstance(0xc000280480, 0xc000884dc0, 0xc000ac98a0) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller_util.go:61 +0x2d0 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).createTrials(0xc0005cf8f0, 0xc000884dc0, {0x27879b8, 0x0, 0x8}, 0x8) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:358 +0x39a github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileTrials(0xc0005a0910, 0xc000884dc0, {0x27879b8, 0x0, 0x0}) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:335 +0x62c github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc000280480, 0xc000884dc0) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:281 +0x2cf github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).Reconcile(0xc000280480, {0x1aee138, 0xc0004b74a0}, {{{0xc0008d3729, 0x5}, {0xc00062b7e8, 0x11}}}) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239 +0x61c sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0007075e0, {0x1aee090, 0xc00092bd00}, {0x16cf5c0, 0xc000548ac0}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:297 +0x303 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0007075e0, {0x1aee090, 0xc00092bd00}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:252 +0x205 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2({0x1aee090, 0xc00092bd00}) /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:215 +0x46 k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1() /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185 +0x25 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f1895b22b98) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:155 +0x67 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0, {0x1ac26a0, 0xc0004b7440}, 0x1, 0xc000560540) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000098ae0, 0x3b9aca00, 0x0, 0x20, 0x0) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0005cf790, 0x0, 0x0, 0x20) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185 +0x99 k8s.io/apimachinery/pkg/util/wait.UntilWithContext({0x1aee090, 0xc00092bd00}, 0xc0009adce0, 0xc0009eb728) /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:99 +0x2b created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:212 +0x356 ```

canonical / bundle-kubeflow

AutoML jobs times out in pipeline until katib-controller is restarted #459