canonical / kfp-operators

Kubeflow Pipelines Operators
Apache License 2.0
2 stars 12 forks source link

Cannot run DSL - Control structures Example workflow #53

Closed Barteus closed 2 years ago

Barteus commented 2 years ago

Deployed CKF+MLFlow on AKS. Minio is configured to use gateway mode.

When trying to run DSL - Control structured pipeline got the error below. image

Error:

{"error":"Failed to create a new run.: Failed to fetch workflow spec manifest.: ResourceNotFoundError: WorkflowSpecManifest Run of [Tutorial] DSL - Control structures (d640b) not found.","code":5,"message":"Failed to create a new run.: Failed to fetch workflow spec manifest.: ResourceNotFoundError: WorkflowSpecManifest Run of [Tutorial] DSL - Control structures (d640b) not found.","details":[{"@type":"type.googleapis.com/api.Error","error_message":"WorkflowSpecManifest Run of [Tutorial] DSL - Control structures (d640b) not found.","error_details":"Failed to create a new run.: Failed to fetch workflow spec manifest.: ResourceNotFoundError: WorkflowSpecManifest Run of [Tutorial] DSL - Control structures (d640b) not found."}]}

Expected - Pipeline is Running and finishes execution correctly.

Barteus commented 2 years ago

Same happens for [Tutorial] Data passing in python components Pipeline.

DomFleischmann commented 2 years ago

Just to confirm, you have been able to run these pipelines successfully in other environments?

What k8s version did you use on AKS?

Barteus commented 2 years ago

Kubernetes version: 1.21.7 I did not try to use them in other environments.

ca-scribner commented 2 years ago

I don't recognise this error. I have successfully run these sample pipelines with our current 1.4 bundle recently, so they should work.

ca-scribner commented 2 years ago

I'm digging into this now. I think the WorkflowSpecManifest is a column in the backing db, so that's my first place to look. Wonder if it isn't working correctly etc

Barteus commented 2 years ago

I can run my own pipelines, but the problem occurs when running the workflows from UI.

ca-scribner commented 2 years ago

Oh strange. Do you mean from a notebook you can define and execute a pipeline successfully, but trying to run something from the UI is what breaks?

Barteus commented 2 years ago

Yes, exactly. This looks like the problem with UI/example.

ca-scribner commented 2 years ago

I believe this because kfp-api cannot access the minio store (creds issue). It maybe could also be that the minio store didn't get initialized correctly (see below missing dir), but pretty sure it is the creds.

I see:

If we look into the minio store we see the mlpipeline folder is missing:

kubectl run minioclient -i --tty --rm --image minio/mc -n kubeflow --command -- /bin/sh
mc alias set myalias http://minio.kubeflow:9000 ACCESSKEY SECRETKEY
mc ls myalias

And similarly, if I look into the kfp-db:

kubectl run dbclient -i --tty --rm --image mariadb -n kubeflow -- /bin/bash
mysql -h kfp-db.kubeflow -u mysql -D mlpipeline -p 
# (enter password)

that db looks correct (I at least see the entries that then show in the UI, etc).

So my guess is the cause might be:

Barteus commented 2 years ago

The initial deployment looked like this:

  1. Deploy the default Charmed Kubeflow bundle
  2. Changes to bundle:
    • Charm revision to 56 from the edge (support for gateway mode)
    • Changes kfp-profile-controller channel to the edge (support for access from notebook to KFP by default)
  3. Create a pipeline run.
Barteus commented 2 years ago

The artefacts are not placed in the minio after relation changes. This change can be connected to switching S3 bucket which requires data migration.

Workaround for this:

  1. login to the container and check if there are pipelines you are missing. In my case:
    [root@minio-0 pipelines]# ls
    1753b558-a5c4-4ec8-81c3-67db77377ef1  6aa9dd33-92cd-429b-9881-021c3e74d50e
    [root@minio-0 pipelines]# pwd
    /data/mlpipeline/pipelines

    The /data volume is there so you will not lose your data.

  2. create a bucket and copy them to the new storage. There is no tar installed in the container, you can get there using kubectl exec <pod> -- /bin/bash and copy-paste the content.
  3. Then create bucket mlpipeline, folder pipelines and put missing files there.

The same needs to happen when you are attaching the new minio because all pipelines are stored not in DB but on object storage. On the other hand, all runs are in DB so ids and folder structure really matter.

ca-scribner commented 2 years ago

@Barteus during deployment was there also a password change for minio somewhere along the way? Maybe because on the redeployment of minio w/gateway mode you had to enter the s3 creds rather than the default ones in the charm? If yes, this all makes sense.

To summarise (assuming above was true):

The result of all this is:

ca-scribner commented 2 years ago

@Barteus I think we can close this with #56 merged right? Or is there an aspect here that isn't solved yet? I'll close now but if I was wrong please reopen

ca-scribner commented 2 years ago

Closes canonical/minio-operator#32