Closed Barteus closed 2 years ago
Same happens for [Tutorial] Data passing in python components
Pipeline.
Just to confirm, you have been able to run these pipelines successfully in other environments?
What k8s version did you use on AKS?
Kubernetes version: 1.21.7 I did not try to use them in other environments.
I don't recognise this error. I have successfully run these sample pipelines with our current 1.4 bundle recently, so they should work.
I'm digging into this now. I think the WorkflowSpecManifest
is a column in the backing db, so that's my first place to look. Wonder if it isn't working correctly etc
I can run my own pipelines, but the problem occurs when running the workflows from UI.
Oh strange. Do you mean from a notebook you can define and execute a pipeline successfully, but trying to run something from the UI is what breaks?
Yes, exactly. This looks like the problem with UI/example.
I believe this because kfp-api cannot access the minio store (creds issue). It maybe could also be that the minio store didn't get initialized correctly (see below missing dir), but pretty sure it is the creds.
I see:
upload_pipeline
from SDK does not workrun_pipeline
from SDK does work
I think the reason this happens is because running a pipeline direct from the SDK does not try to store the pipeline file in the minio store. During the debugging, sometimes I'd get an error message that included The Access Key Id you provided does not exist in our records
which from Google appears to be a minio message. If we look into the minio store we see the mlpipeline folder is missing:
kubectl run minioclient -i --tty --rm --image minio/mc -n kubeflow --command -- /bin/sh
mc alias set myalias http://minio.kubeflow:9000 ACCESSKEY SECRETKEY
mc ls myalias
And similarly, if I look into the kfp-db:
kubectl run dbclient -i --tty --rm --image mariadb -n kubeflow -- /bin/bash
mysql -h kfp-db.kubeflow -u mysql -D mlpipeline -p
# (enter password)
that db looks correct (I at least see the entries that then show in the UI, etc).
So my guess is the cause might be:
The initial deployment looked like this:
The artefacts are not placed in the minio after relation changes. This change can be connected to switching S3 bucket which requires data migration.
Workaround for this:
[root@minio-0 pipelines]# ls
1753b558-a5c4-4ec8-81c3-67db77377ef1 6aa9dd33-92cd-429b-9881-021c3e74d50e
[root@minio-0 pipelines]# pwd
/data/mlpipeline/pipelines
The /data volume is there so you will not lose your data.
kubectl exec <pod> -- /bin/bash
and copy-paste the content. The same needs to happen when you are attaching the new minio because all pipelines are stored not in DB but on object storage. On the other hand, all runs are in DB so ids and folder structure really matter.
@Barteus during deployment was there also a password change for minio somewhere along the way? Maybe because on the redeployment of minio w/gateway mode you had to enter the s3 creds rather than the default ones in the charm? If yes, this all makes sense.
To summarise (assuming above was true):
compositecontroller
being created that manages kfp resources in user namespaces (including the mlpipeline-minio-artifact
) via a webhook (which hits the kfp-profile-controller
pod). This would result in the mlpipeline-minio-artifact
with the minio creds that existed at this time. Also as part of the kfp-api being deployed(?), the minio store gets initialized with the the pipelines/
dir, etcpipelines/
and the sample pipelineskfp-profile-controller
pod for the webhook both being redeployed with the updated minio creds, but it does not result in the mlpipeline-minio-artifact
getting updated with these creds because the compositecontroller
/webhook system do not detect when a secret is stale and thus do not update the mlpipeline-minio-artifact
. Notable is that if at this stage we also deleted the mlpipeline-minio-artifact
, the compositecontroller
would recreated it with the updated credentials as we desire. The result of all this is:
pipelines/
dir in the minio store was also cleared, so that's why we couldn't use those pipeline definitionsmlpipeline-minio-artifact
secret, so their creds are stale. That is why if we run a pipeline we see all the steps fail with the message of The Access Key Id you provided does not exist in our records
@Barteus I think we can close this with #56 merged right? Or is there an aspect here that isn't solved yet? I'll close now but if I was wrong please reopen
Closes canonical/minio-operator#32
Deployed CKF+MLFlow on AKS. Minio is configured to use gateway mode.
When trying to run DSL - Control structured pipeline got the error below.
Error:
Expected - Pipeline is Running and finishes execution correctly.