canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
97 stars 47 forks source link

UATs are failing with 401 in one-click installation #951

Open kimwnasptd opened 2 days ago

kimwnasptd commented 2 days ago

Bug Description

After installing CKF in Azure with one-click deployment then some of the UATs are failing with 401 errors.

Specifically the tests that fail are the ones that try to trigger a Pipeline, so there's a high chance that something is happening with the ServiceAccount that gets injected to the Notebooks (maybe the PodDefault is never created?)

To Reproduce

  1. Deploy CKF on Azure with one-click
  2. Run the UATs


CKF 1.8 on Azure

Relevant Log Output

Additional Context

No response

syncronize-issues-to-jira[bot] commented 2 days ago

Thank you for reporting us your feedback!

The internal ticket has been created:

This message was autogenerated

misohu commented 19 hours ago

Environment: juju 3.4.4, aks v1.28.9, kubeflow 1.8/stable

I have created the oneclick deployment. I made sure the components are active.

Model     Controller  Cloud/Region       Version  SLA          Timestamp
kubeflow  manual      k8s-cloud/uksouth  3.4.4    unsupported  08:38:56Z

SAAS        Status  Store  URL
grafana     active  local  admin/cos.grafana
prometheus  active  local  admin/cos.prometheus

App                        Version                  Status  Scale  Charm                    Channel          Rev  Address       Exposed  Message
admission-webhook                                   active      1  admission-webhook        1.8/stable       301  no       
argo-controller                                     active      1  argo-controller          3.3.10/stable    424     no       
dex-auth                                            active      1  dex-auth                 2.36/stable      422   no       
envoy                      res:oci-image@cc06b3e    active      1  envoy                    2.0/stable       194  no       
grafana-agent-k8s          0.40.4                   active      1  grafana-agent-k8s        latest/edge       80  no       logging-consumer: off
istio-ingressgateway                                active      1  istio-gateway            1.17/stable     1000  no       
istio-pilot                                         active      1  istio-pilot              1.17/stable     1011   no       
jupyter-controller                                  active      1  jupyter-controller       1.8/stable       849  no       
jupyter-ui                                          active      1  jupyter-ui               1.8/stable       858    no       
katib-controller           res:oci-image@31ccd70    active      1  katib-controller         0.16/stable      576     no       
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153    no       
katib-db-manager                                    active      1  katib-db-manager         0.16/stable      539   no       
katib-ui                                            active      1  katib-ui                 0.16/stable      422  no       
kfp-api                                             active      1  kfp-api                  2.0/stable      1283  no       
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153     no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      2.0/stable       334   no       
kfp-persistence                                     active      1  kfp-persistence          2.0/stable      1291   no       
kfp-profile-controller                              active      1  kfp-profile-controller   2.0/stable      1315   no       
kfp-schedwf                                         active      1  kfp-schedwf              2.0/stable      1466  no       
kfp-ui                                              active      1  kfp-ui                   2.0/stable      1285    no       
kfp-viewer                                          active      1  kfp-viewer               2.0/stable      1317   no       
kfp-viz                                             active      1  kfp-viz                  2.0/stable      1235  no       
knative-eventing                                    active      1  knative-eventing         1.10/stable      353    no       
knative-operator                                    active      1  knative-operator         1.10/stable      328    no       
knative-serving                                     active      1  knative-serving          1.10/stable      409  no       
kserve-controller                                   active      1  kserve-controller        0.11/stable      523  no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       1.8/stable       582   no       
kubeflow-profiles                                   active      1  kubeflow-profiles        1.8/stable       355   no       
kubeflow-roles                                      active      1  kubeflow-roles           1.8/stable       187    no       
kubeflow-volumes           res:oci-image@2261827    active      1  kubeflow-volumes         1.8/stable       260    no       
metacontroller-operator                             active      1  metacontroller-operator  3.0/stable       252   no       
minio                      res:oci-image@1755999    active      1  minio                    ckf-1.8/stable   278    no       
mlflow-mysql               8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       153  no       
mlflow-server                                       active      1  mlflow-server            2.1/stable       466  no       
mlmd                       res:oci-image@44abc5d    active      1  mlmd                     1.14/stable      127   no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          ckf-1.8/stable   350  no       
pvcviewer-operator                                  active      1  pvcviewer-operator       1.8/stable        30  no       
resource-dispatcher                                 active      1  resource-dispatcher      1.0/stable        93    no       
seldon-controller-manager                           active      1  seldon-core              1.17/stable      664    no       
tensorboard-controller                              active      1  tensorboard-controller   1.8/stable       257  no       
tensorboards-web-app                                active      1  tensorboards-web-app     1.8/stable       245  no       
training-operator                                   active      1  training-operator        1.7/stable       347  no       

Unit                          Workload  Agent  Address      Ports          Message
admission-webhook/0*          active    idle                 
argo-controller/0*            active    idle                 
dex-auth/0*                   active    idle                 
envoy/0*                      active    idle  9090,9901/TCP  
grafana-agent-k8s/0*          active    idle                 logging-consumer: off
istio-ingressgateway/0*       active    idle                 
istio-pilot/0*                active    idle                 
jupyter-controller/0*         active    idle                  
jupyter-ui/0*                 active    idle                 
katib-controller/0*           active    idle  443,8080/TCP   
katib-db-manager/0*           active    idle                 
katib-db/0*                   active    idle                 Primary
katib-ui/0*                   active    idle                 
kfp-api/0*                    active    idle                 
kfp-db/0*                     active    idle                 Primary
kfp-metadata-writer/0*        active    idle                 
kfp-persistence/0*            active    idle                 
kfp-profile-controller/0*     active    idle                 
kfp-schedwf/0*                active    idle                  
kfp-ui/0*                     active    idle                 
kfp-viewer/0*                 active    idle                 
kfp-viz/0*                    active    idle                 
knative-eventing/0*           active    idle                 
knative-operator/0*           active    idle                 
knative-serving/0*            active    idle                  
kserve-controller/0*          active    idle                 
kubeflow-dashboard/0*         active    idle                  
kubeflow-profiles/0*          active    idle                 
kubeflow-roles/0*             active    idle                 
kubeflow-volumes/2*           active    idle   5000/TCP       
metacontroller-operator/0*    active    idle                 
minio/0*                      active    idle  9000-9001/TCP  
mlflow-mysql/0*               active    idle                 Primary
mlflow-server/0*              active    idle                 
mlmd/0*                       active    idle  8080/TCP       
oidc-gatekeeper/0*            active    idle                 
pvcviewer-operator/0*         active    idle                 
resource-dispatcher/0*        active    idle                 
seldon-controller-manager/0*  active    idle                 
tensorboard-controller/0*     active    idle                  
tensorboards-web-app/0*       active    idle                 
training-operator/0*          active    idle

Then I have setup python3.8 with tox

sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt update -y
sudo apt install python3.8 python3.8-distutils python3.8-venv -y
python3.8 -m pip install tox
export PATH=$PATH:/home/ubuntu/.local/bin
tox --version

Then I have cloned UATs repo and run the UATs from main branch

git clone
cd charmed-kubeflow-uats/
git checkout main 
tox -e kubeflow-remote

Test passed

platform linux -- Python 3.8.10, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.8
cachedir: .pytest_cache
rootdir: /tests/.worktrees/fe86b4e255c4c695376f70061a6a645301350d5a/tests
configfile: pytest.ini
plugins: anyio-3.6.2
collecting ... collected 9 items / 4 deselected / 5 selected[katib-integration] 
-------------------------------- live log call ---------------------------------
INFO Running katib-integration.ipynb...
PASSED                                                                   [ 20%][kfp-v1-integration] 
-------------------------------- live log call ---------------------------------
INFO Running kfp-v1-integration.ipynb...
PASSED                                                                   [ 40%][kfp-v2-integration] 
-------------------------------- live log call ---------------------------------
INFO Running kfp-v2-integration.ipynb...
PASSED                                                                   [ 60%][kserve-integration] 
-------------------------------- live log call ---------------------------------
INFO Running kserve-integration.ipynb...
PASSED                                                                   [ 80%][training-integration] 
-------------------------------- live log call ---------------------------------
INFO Running training-integration.ipynb...
PASSED                                                                   [100%]

================= 5 passed, 4 deselected in 1087.81s (0:18:07) =================
--------------------------------------------------------------------------------------------- live log teardown ----------------------------------------------------------------------------------------------
INFO Deleting Profile test-kubeflow...
INFO HTTP Request: DELETE "HTTP/1.1 200 OK"
INFO Deleting Job test-kubeflow/test-kubeflow...
INFO HTTP Request: DELETE "HTTP/1.1 200 OK"

======================================================================================= 2 passed in 1154.07s (0:19:14) =======================================================================================
  kubeflow-remote: OK (1178.19=setup[22.82]+cmd[1155.36] seconds)
  congratulations :) (1178.33 seconds)

Here is the pods log showing no problems

ubuntu@vu34wtsmbwx56BootstrapVm:~$ kubectl get po -n test-kubeflow --watch
NAME                                              READY   STATUS            RESTARTS   AGE
ml-pipeline-ui-artifact-6b89ccc469-2b72n          2/2     Running           0          48s
ml-pipeline-visualizationserver-955b54775-nkvg8   0/2     PodInitializing   0          48s
test-kubeflow-dx6vv                               2/2     Running           0          49s
cmaes-example-cmaes-75f5b9d5dd-r8m8k              0/1     Pending           0          0s
cmaes-example-cmaes-75f5b9d5dd-r8m8k              0/1     Pending           0          0s
cmaes-example-cmaes-75f5b9d5dd-r8m8k              0/1     ContainerCreating   0          0s
cmaes-example-cmaes-75f5b9d5dd-r8m8k              0/1     Running             0          5s
cmaes-example-cmaes-75f5b9d5dd-r8m8k              1/1     Running             0          21s
cmaes-example-4qnb4dgb-6fzmt                      0/2     Pending             0          0s
cmaes-example-4qnb4dgb-6fzmt                      0/2     Pending             0          0s
cmaes-example-bzqd6kft-f6rvw                      0/2     Pending             0          0s
cmaes-example-bzqd6kft-f6rvw                      0/2     Pending             0          0s
cmaes-example-4qnb4dgb-6fzmt                      0/2     ContainerCreating   0          0s
cmaes-example-bzqd6kft-f6rvw                      0/2     ContainerCreating   0          0s
ml-pipeline-visualizationserver-955b54775-nkvg8   1/2     Running             0          111s
ml-pipeline-visualizationserver-955b54775-nkvg8   2/2     Running             0          111s
cmaes-example-4qnb4dgb-6fzmt                      2/2     Running             0          42s
cmaes-example-bzqd6kft-f6rvw                      2/2     Running             0          45s
cmaes-example-bzqd6kft-f6rvw                      1/2     NotReady            0          90s
cmaes-example-bzqd6kft-f6rvw                      0/2     Completed           0          92s
cmaes-example-bzqd6kft-f6rvw                      0/2     Completed           0          93s
cmaes-example-bzqd6kft-f6rvw                      0/2     Completed           0          94s
cmaes-example-bzqd6kft-f6rvw                      0/2     Completed           0          94s
cmaes-example-bzqd6kft-f6rvw                      0/2     Terminating         0          94s
cmaes-example-bzqd6kft-f6rvw                      0/2     Terminating         0          94s
cmaes-example-xqr2mc8k-6gn27                      0/2     Pending             0          0s
cmaes-example-xqr2mc8k-6gn27                      0/2     Pending             0          0s
cmaes-example-xqr2mc8k-6gn27                      0/2     ContainerCreating   0          0s
cmaes-example-4qnb4dgb-6fzmt                      1/2     NotReady            0          95s
cmaes-example-4qnb4dgb-6fzmt                      0/2     Completed           0          96s
cmaes-example-xqr2mc8k-6gn27                      2/2     Running             0          2s
cmaes-example-4qnb4dgb-6fzmt                      0/2     Completed           0          98s
cmaes-example-4qnb4dgb-6fzmt                      0/2     Completed           0          98s
cmaes-example-4qnb4dgb-6fzmt                      0/2     Completed           0          99s
cmaes-example-4qnb4dgb-6fzmt                      0/2     Terminating         0          99s
cmaes-example-4qnb4dgb-6fzmt                      0/2     Terminating         0          99s
cmaes-example-xqr2mc8k-6gn27                      1/2     NotReady            0          58s
cmaes-example-xqr2mc8k-6gn27                      0/2     Completed           0          60s
cmaes-example-xqr2mc8k-6gn27                      0/2     Completed           0          61s
cmaes-example-xqr2mc8k-6gn27                      0/2     Completed           0          62s
cmaes-example-xqr2mc8k-6gn27                      0/2     Completed           0          62s
cmaes-example-xqr2mc8k-6gn27                      0/2     Terminating         0          62s
cmaes-example-xqr2mc8k-6gn27                      0/2     Terminating         0          62s
cmaes-example-cmaes-75f5b9d5dd-r8m8k              1/1     Terminating         0          2m58s
cmaes-example-cmaes-75f5b9d5dd-r8m8k              0/1     Terminating         0          2m59s
cmaes-example-cmaes-75f5b9d5dd-r8m8k              0/1     Terminating         0          3m
cmaes-example-cmaes-75f5b9d5dd-r8m8k              0/1     Terminating         0          3m
cmaes-example-cmaes-75f5b9d5dd-r8m8k              0/1     Terminating         0          3m
calculation-pipeline-wkrc2-4050137206             0/2     Pending             0          0s
calculation-pipeline-wkrc2-4050137206             0/2     Pending             0          0s
calculation-pipeline-wkrc2-4050137206             0/2     Init:0/1            0          0s
calculation-pipeline-wkrc2-3212673545             0/2     Pending             0          0s
calculation-pipeline-wkrc2-3212673545             0/2     Pending             0          0s
calculation-pipeline-wkrc2-3212673545             0/2     Init:0/1            0          0s
calculation-pipeline-wkrc2-4050137206             0/2     Init:0/1            0          0s
calculation-pipeline-wkrc2-3212673545             0/2     Init:0/1            0          0s
calculation-pipeline-wkrc2-4050137206             0/2     Init:0/1            0          8s
calculation-pipeline-wkrc2-3212673545             0/2     Init:0/1            0          9s
calculation-pipeline-wkrc2-4050137206             0/2     PodInitializing     0          9s
calculation-pipeline-wkrc2-3212673545             0/2     PodInitializing     0          10s
calculation-pipeline-wkrc2-4050137206             2/2     Running             0          29s
calculation-pipeline-wkrc2-3212673545             2/2     Running             0          30s
calculation-pipeline-wkrc2-4050137206             2/2     Running             0          30s
calculation-pipeline-wkrc2-4050137206             2/2     Running             0          31s
calculation-pipeline-wkrc2-3212673545             2/2     Running             0          31s
calculation-pipeline-wkrc2-3212673545             2/2     Running             0          31s
calculation-pipeline-wkrc2-3212673545             0/2     Completed           0          35s
calculation-pipeline-wkrc2-4050137206             0/2     Completed           0          35s
calculation-pipeline-wkrc2-3212673545             0/2     Completed           0          37s
calculation-pipeline-wkrc2-4050137206             0/2     Completed           0          37s
calculation-pipeline-wkrc2-4050137206             0/2     Completed           0          37s
calculation-pipeline-wkrc2-3212673545             0/2     Completed           0          37s
calculation-pipeline-wkrc2-3195895926             0/2     Pending             0          0s
calculation-pipeline-wkrc2-3195895926             0/2     Pending             0          0s
calculation-pipeline-wkrc2-3195895926             0/2     Init:0/1            0          0s
calculation-pipeline-wkrc2-4050137206             0/2     Completed           0          39s
calculation-pipeline-wkrc2-3212673545             0/2     Completed           0          39s
calculation-pipeline-wkrc2-3195895926             0/2     Init:0/1            0          0s
calculation-pipeline-wkrc2-3195895926             0/2     Init:0/1            0          1s
calculation-pipeline-wkrc2-3195895926             0/2     PodInitializing     0          2s
calculation-pipeline-wkrc2-3195895926             2/2     Running             0          3s
calculation-pipeline-wkrc2-3195895926             1/2     NotReady            0          4s
calculation-pipeline-wkrc2-3195895926             1/2     NotReady            0          4s
calculation-pipeline-wkrc2-3195895926             1/2     NotReady            0          5s
calculation-pipeline-wkrc2-3195895926             0/2     Completed           0          5s
calculation-pipeline-wkrc2-3195895926             0/2     Completed           0          6s
calculation-pipeline-wkrc2-3195895926             0/2     Completed           0          7s
calculation-pipeline-wkrc2-3195895926             0/2     Completed           0          10s
condition-v2-x6h2p-502777903                      0/2     Pending             0          0s
condition-v2-x6h2p-502777903                      0/2     Pending             0          0s
condition-v2-x6h2p-502777903                      0/2     Init:0/1            0          0s
condition-v2-x6h2p-502777903                      0/2     PodInitializing     0          1s
condition-v2-x6h2p-502777903                      2/2     Running             0          6s
condition-v2-x6h2p-502777903                      2/2     Running             0          7s
condition-v2-x6h2p-502777903                      0/2     Completed           0          7s
condition-v2-x6h2p-502777903                      0/2     Completed           0          9s
condition-v2-x6h2p-502777903                      0/2     Completed           0          9s
condition-v2-x6h2p-3683981472                     0/2     Pending             0          0s
condition-v2-x6h2p-3683981472                     0/2     Pending             0          0s
condition-v2-x6h2p-3683981472                     0/2     Init:0/1            0          0s
condition-v2-x6h2p-502777903                      0/2     Completed           0          10s
condition-v2-x6h2p-3683981472                     0/2     PodInitializing     0          1s
condition-v2-x6h2p-3683981472                     2/2     Running             0          2s
condition-v2-x6h2p-3683981472                     1/2     NotReady            0          3s
condition-v2-x6h2p-3683981472                     1/2     NotReady            0          4s
condition-v2-x6h2p-3683981472                     0/2     Completed           0          4s
condition-v2-x6h2p-3683981472                     0/2     Completed           0          6s
condition-v2-x6h2p-3683981472                     0/2     Completed           0          6s
condition-v2-x6h2p-135267782                      0/2     Pending             0          0s
condition-v2-x6h2p-135267782                      0/2     Pending             0          0s
condition-v2-x6h2p-135267782                      0/2     Init:0/2            0          0s
condition-v2-x6h2p-3683981472                     0/2     Completed           0          10s
condition-v2-x6h2p-135267782                      0/2     Init:1/2            0          1s
condition-v2-x6h2p-135267782                      0/2     Init:1/2            0          5s
condition-v2-x6h2p-135267782                      0/2     PodInitializing     0          6s
condition-v2-x6h2p-135267782                      2/2     Running             0          7s
condition-v2-x6h2p-135267782                      0/2     Completed           0          14s
condition-v2-x6h2p-135267782                      0/2     Completed           0          16s
condition-v2-x6h2p-135267782                      0/2     Completed           0          16s
condition-v2-x6h2p-884988224                      0/2     Pending             0          0s
condition-v2-x6h2p-884988224                      0/2     Pending             0          0s
condition-v2-x6h2p-884988224                      0/2     Init:0/1            0          0s
condition-v2-x6h2p-756913840                      0/2     Pending             0          0s
condition-v2-x6h2p-756913840                      0/2     Pending             0          0s
condition-v2-x6h2p-756913840                      0/2     Init:0/1            0          0s
condition-v2-x6h2p-135267782                      0/2     Completed           0          25s
condition-v2-x6h2p-884988224                      0/2     PodInitializing     0          2s
condition-v2-x6h2p-884988224                      2/2     Running             0          3s
condition-v2-x6h2p-884988224                      1/2     NotReady            0          4s
condition-v2-x6h2p-884988224                      1/2     NotReady            0          4s
condition-v2-x6h2p-884988224                      0/2     Completed           0          5s
condition-v2-x6h2p-884988224                      0/2     Completed           0          6s
condition-v2-x6h2p-884988224                      0/2     Completed           0          7s
condition-v2-x6h2p-756913840                      0/2     Init:0/1            0          8s
condition-v2-x6h2p-884988224                      0/2     Completed           0          10s
condition-v2-x6h2p-756913840                      0/2     PodInitializing     0          10s
condition-v2-x6h2p-756913840                      2/2     Running             0          14s
condition-v2-x6h2p-756913840                      2/2     Running             0          15s
condition-v2-x6h2p-756913840                      0/2     Completed           0          16s
condition-v2-x6h2p-756913840                      0/2     Completed           0          17s
condition-v2-x6h2p-756913840                      0/2     Completed           0          18s
condition-v2-x6h2p-3477408950                     0/2     Pending             0          0s
condition-v2-x6h2p-3477408950                     0/2     Pending             0          0s
condition-v2-x6h2p-3477408950                     0/2     Init:0/2            0          0s
condition-v2-x6h2p-756913840                      0/2     Completed           0          20s
condition-v2-x6h2p-3477408950                     0/2     Init:1/2            0          2s
condition-v2-x6h2p-3477408950                     0/2     PodInitializing     0          3s
condition-v2-x6h2p-3477408950                     2/2     Running             0          4s
condition-v2-x6h2p-3477408950                     0/2     Completed           0          11s
condition-v2-x6h2p-3477408950                     0/2     Completed           0          12s
condition-v2-x6h2p-3477408950                     0/2     Completed           0          13s
condition-v2-x6h2p-3477408950                     0/2     Completed           0          21s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   0/2     Pending             0          0s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   0/2     Pending             0          0s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   0/2     Init:0/1            0          0s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   0/2     Init:0/1            0          15s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   0/2     PodInitializing     0          26s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   1/2     Running             0          46s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   2/2     Running             0          60s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   2/2     Terminating         0          62s
mnist-chief-0                                              0/1     Pending             0          0s
mnist-chief-0                                              0/1     Pending             0          0s
mnist-chief-0                                              0/1     ContainerCreating   0          0s
mnist-ps-0                                                 0/1     Pending             0          0s
mnist-ps-0                                                 0/1     Pending             0          0s
mnist-ps-0                                                 0/1     ContainerCreating   0          0s
mnist-worker-0                                             0/1     Pending             0          0s
mnist-worker-0                                             0/1     Pending             0          0s
mnist-worker-0                                             0/1     ContainerCreating   0          0s
mnist-worker-1                                             0/1     Pending             0          0s
mnist-worker-1                                             0/1     Pending             0          0s
mnist-worker-1                                             0/1     ContainerCreating   0          0s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   1/2     Terminating         0          90s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   0/2     Terminating         0          94s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   0/2     Terminating         0          94s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   0/2     Terminating         0          94s
sklearn-iris-predictor-00001-deployment-6f6f6bc97b-4qmp8   0/2     Terminating         0          94s
mnist-chief-0                                              1/1     Running             0          27s
mnist-worker-0                                             1/1     Running             0          27s
mnist-ps-0                                                 1/1     Running             0          28s
mnist-worker-1                                             1/1     Running             0          29s
mnist-worker-0                                             0/1     Completed           0          2m36s
mnist-chief-0                                              0/1     Completed           0          2m37s
mnist-worker-0                                             0/1     Completed           0          2m38s
mnist-chief-0                                              0/1     Completed           0          2m38s
mnist-worker-0                                             0/1     Completed           0          2m38s
mnist-chief-0                                              0/1     Completed           0          2m39s
mnist-worker-0                                             0/1     Terminating         0          3m
mnist-worker-1                                             1/1     Terminating         0          3m
mnist-chief-0                                              0/1     Terminating         0          3m
mnist-chief-0                                              0/1     Terminating         0          3m
mnist-ps-0                                                 1/1     Terminating         0          3m
mnist-worker-0                                             0/1     Terminating         0          3m
pytorch-mnist-gloo-master-0                                0/1     Pending             0          0s
pytorch-mnist-gloo-master-0                                0/1     Pending             0          0s
pytorch-mnist-gloo-master-0                                0/1     ContainerCreating   0          0s
pytorch-mnist-gloo-worker-0                                0/1     Pending             0          0s
pytorch-mnist-gloo-worker-0                                0/1     Pending             0          0s
pytorch-mnist-gloo-worker-0                                0/1     Init:0/1            0          0s
mnist-ps-0                                                 0/1     Terminating         0          3m3s
mnist-ps-0                                                 0/1     Terminating         0          3m3s
mnist-ps-0                                                 0/1     Terminating         0          3m3s
mnist-ps-0                                                 0/1     Terminating         0          3m3s
mnist-worker-1                                             0/1     Terminating         0          3m3s
mnist-worker-1                                             0/1     Terminating         0          3m4s
mnist-worker-1                                             0/1     Terminating         0          3m4s
mnist-worker-1                                             0/1     Terminating         0          3m4s
pytorch-mnist-gloo-master-0                                1/1     Running             0          78s
pytorch-mnist-gloo-worker-0                                0/1     Init:0/1            0          80s
pytorch-mnist-gloo-worker-0                                0/1     PodInitializing     0          85s
pytorch-mnist-gloo-worker-0                                1/1     Running             0          86s
pytorch-mnist-gloo-worker-0                                0/1     Completed           0          4m26s
pytorch-mnist-gloo-master-0                                0/1     Completed           0          4m27s
pytorch-mnist-gloo-worker-0                                0/1     Completed           0          4m27s
pytorch-mnist-gloo-worker-0                                0/1     Completed           0          4m28s
pytorch-mnist-gloo-master-0                                0/1     Completed           0          4m28s
pytorch-mnist-gloo-master-0                                0/1     Completed           0          4m29s
pytorch-mnist-gloo-worker-0                                0/1     Terminating         0          4m30s
pytorch-mnist-gloo-master-0                                0/1     Terminating         0          4m30s
pytorch-mnist-gloo-worker-0                                0/1     Terminating         0          4m30s
pytorch-mnist-gloo-master-0                                0/1     Terminating         0          4m30s
paddle-simple-cpu-worker-0                                 0/1     Pending             0          0s
paddle-simple-cpu-worker-0                                 0/1     Pending             0          0s
paddle-simple-cpu-worker-0                                 0/1     ContainerCreating   0          0s
paddle-simple-cpu-worker-1                                 0/1     Pending             0          0s
paddle-simple-cpu-worker-1                                 0/1     Pending             0          0s
paddle-simple-cpu-worker-1                                 0/1     ContainerCreating   0          0s
paddle-simple-cpu-worker-1                                 1/1     Running             0          78s
paddle-simple-cpu-worker-0                                 1/1     Running             0          79s
paddle-simple-cpu-worker-0                                 0/1     Completed           0          96s
paddle-simple-cpu-worker-1                                 0/1     Completed           0          96s
paddle-simple-cpu-worker-0                                 0/1     Completed           0          98s
paddle-simple-cpu-worker-1                                 0/1     Completed           0          98s
paddle-simple-cpu-worker-1                                 0/1     Completed           0          98s
paddle-simple-cpu-worker-0                                 0/1     Completed           0          98s
paddle-simple-cpu-worker-1                                 0/1     Terminating         0          2m
paddle-simple-cpu-worker-0                                 0/1     Terminating         0          2m
paddle-simple-cpu-worker-1                                 0/1     Terminating         0          2m
paddle-simple-cpu-worker-0                                 0/1     Terminating         0          2m
test-kubeflow-dx6vv                                        1/2     NotReady            0          18m
test-kubeflow-dx6vv                                        0/2     Completed           0          19m
test-kubeflow-dx6vv                                        0/2     Completed           0          19m
test-kubeflow-dx6vv                                        0/2     Completed           0          19m
test-kubeflow-dx6vv                                        0/2     Completed           0          19m
test-kubeflow-dx6vv                                        0/2     Completed           0          19m
calculation-pipeline-wkrc2-3195895926                      0/2     Terminating         0          14m
calculation-pipeline-wkrc2-3195895926                      0/2     Terminating         0          14m
calculation-pipeline-wkrc2-3212673545                      0/2     Terminating         0          14m
calculation-pipeline-wkrc2-3212673545                      0/2     Terminating         0          14m
calculation-pipeline-wkrc2-4050137206                      0/2     Terminating         0          14m
calculation-pipeline-wkrc2-4050137206                      0/2     Terminating         0          14m
condition-v2-x6h2p-135267782                               0/2     Terminating         0          13m
condition-v2-x6h2p-135267782                               0/2     Terminating         0          13m
condition-v2-x6h2p-3477408950                              0/2     Terminating         0          12m
condition-v2-x6h2p-3477408950                              0/2     Terminating         0          12m
condition-v2-x6h2p-3683981472                              0/2     Terminating         0          13m
condition-v2-x6h2p-3683981472                              0/2     Terminating         0          13m
condition-v2-x6h2p-502777903                               0/2     Terminating         0          13m
condition-v2-x6h2p-502777903                               0/2     Terminating         0          13m
condition-v2-x6h2p-756913840                               0/2     Terminating         0          13m
condition-v2-x6h2p-756913840                               0/2     Terminating         0          13m
condition-v2-x6h2p-884988224                               0/2     Terminating         0          13m
condition-v2-x6h2p-884988224                               0/2     Terminating         0          13m
ml-pipeline-ui-artifact-6b89ccc469-2b72n                   2/2     Terminating         0          19m
ml-pipeline-visualizationserver-955b54775-nkvg8            2/2     Terminating         0          19m
test-kubeflow-dx6vv                                        0/2     Terminating         0          19m
test-kubeflow-dx6vv                                        0/2     Terminating         0          19m
ml-pipeline-visualizationserver-955b54775-nkvg8            0/2     Terminating         0          19m
ml-pipeline-ui-artifact-6b89ccc469-2b72n                   0/2     Terminating         0          19m
ml-pipeline-ui-artifact-6b89ccc469-2b72n                   0/2     Terminating         0          19m
ml-pipeline-ui-artifact-6b89ccc469-2b72n                   0/2     Terminating         0          19m
ml-pipeline-ui-artifact-6b89ccc469-2b72n                   0/2     Terminating         0          19m
ml-pipeline-visualizationserver-955b54775-nkvg8            0/2     Terminating         0          19m
ml-pipeline-visualizationserver-955b54775-nkvg8            0/2     Terminating         0          19m
ml-pipeline-visualizationserver-955b54775-nkvg8            0/2     Terminating         0          19m

Here is the experiments log without problems

ubuntu@vu34wtsmbwx56BootstrapVm:~$ kubectl get experiment -n test-kubeflow --watch
NAME            TYPE      STATUS   AGE
cmaes-example   Created   True     19s
cmaes-example   Running   True     22s
cmaes-example   Running   True     22s
cmaes-example   Running   True     22s
cmaes-example   Running   True     116s
cmaes-example   Running   True     116s
cmaes-example   Running   True     116s
cmaes-example   Running   True     2m1s
cmaes-example   Running   True     2m1s
cmaes-example   Succeeded   True     2m58s
cmaes-example   Succeeded   True     3m5s
cmaes-example   Succeeded   True     3m5s

I have run the test twice without problems.

misohu commented 16 hours ago

After further investigation we found out that the error was caused by running the tox env in python3.10. Switching to 3.8 resolved the issue. The tests are supposed to be executed on python3.8.