canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
98 stars 48 forks source link

mlflow secret used for seldon is not working (seldon-init-container-secret) #429

Closed Barteus closed 1 year ago

Barteus commented 2 years ago

When seldon-init-container-secret is used to deploy the model from minio the credentials passed to the RCLONE in init container are wrong. The Pod is going into the status Init:CrashLoopBackOff

Log from init container:

2022/02/09 08:52:58 NOTICE: Config file "/config/rclone/rclone.conf" not found - using defaults
2022/02/09 08:52:58 DEBUG : rclone: Version "v1.55.1" starting with parameters ["rclone" "copy" "-vv" "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model" "/mnt/models"]
2022/02/09 08:52:58 DEBUG : Creating backend with remote "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model"
2022/02/09 08:52:58 Failed to create file system for "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model": didn't find section in config file

Expected log from init container (removed not relevant sections):

2022/02/09 08:37:56 NOTICE: Config file "/config/rclone/rclone.conf" not found - using defaults
2022/02/09 08:37:56 DEBUG : rclone: Version "v1.55.1" starting with parameters ["rclone" "copy" "-vv" "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model" "/mnt/models"]
2022/02/09 08:37:56 DEBUG : Creating backend with remote "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model"
2022/02/09 08:37:56 DEBUG : s3: detected overridden config - adding "{jVufS}" suffix to name
2022/02/09 08:37:56 DEBUG : pacer: low level retry 1/10 (error RequestError: send request failed
caused by: Head "http://minio.kubeflow.svc.cluster.local:9000/mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model": EOF)
2022/02/09 08:37:56 DEBUG : pacer: Rate limited, increasing sleep to 10ms
...
2022/02/09 08:37:58 DEBUG : pacer: low level retry 10/10 (error RequestError: send request failed
caused by: Head "http://minio.kubeflow.svc.cluster.local:9000/mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model": EOF)
2022/02/09 08:37:58 DEBUG : fs cache: renaming cache item "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model" to be canonical "s3{jVufS}:mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model"
2022/02/09 08:37:58 DEBUG : Creating backend with remote "/mnt/models"
2022/02/09 08:38:00 DEBUG : pacer: Reducing sleep to 1.5s
2022/02/09 08:38:00 DEBUG : Local file system at /mnt/models: Waiting for checks to finish
2022/02/09 08:38:00 DEBUG : Local file system at /mnt/models: Waiting for transfers to finish
2022/02/09 08:38:02 DEBUG : pacer: Reducing sleep to 1.125s
...
2022/02/09 08:38:06 DEBUG : pacer: Reducing sleep to 355.957031ms
2022/02/09 08:38:06 DEBUG : MLmodel: MD5 = a9bc2a512a382c222925645b10032c1c OK
2022/02/09 08:38:06 INFO  : MLmodel: Copied (new)
2022/02/09 08:38:07 DEBUG : pacer: Reducing sleep to 266.967773ms
2022/02/09 08:38:07 DEBUG : conda.yaml: MD5 = 8ff59fc0b665266cef1c86a60d92a006 OK
2022/02/09 08:38:07 INFO  : conda.yaml: Copied (new)
2022/02/09 08:38:07 DEBUG : pacer: Reducing sleep to 200.225829ms
2022/02/09 08:38:07 DEBUG : model.pkl: MD5 = 4423cef46a1eeb20736d88c980c11f3d OK
2022/02/09 08:38:07 INFO  : model.pkl: Copied (new)
2022/02/09 08:38:07 DEBUG : pacer: Reducing sleep to 150.169371ms
2022/02/09 08:38:07 DEBUG : requirements.txt: MD5 = 725d23405b4e11d989db76029151a90a OK
2022/02/09 08:38:07 INFO  : requirements.txt: Copied (new)
2022/02/09 08:38:07 INFO  : 
Transferred:        1.231k / 1.231 kBytes, 100%, 175 Bytes/s, ETA 0s
Transferred:            4 / 4, 100%
Elapsed time:        11.8s

2022/02/09 08:38:07 DEBUG : 5 go routines active

Working secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: bpk-seldon-init-container-secret
type: Opaque
stringData:
  RCLONE_CONFIG_S3_TYPE: s3
  RCLONE_CONFIG_S3_PROVIDER: minio
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: <key>
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: <secret>
  RCLONE_CONFIG_S3_ENDPOINT: http://minio.kubeflow.svc.cluster.local:9000
  RCLONE_CONFIG_S3_ENV_AUTH: "false"

Deployment example.yaml:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
  - componentSpecs:
    - spec:
        # We are setting high failureThreshold as installing conda dependencies
        # can take long time and we want to avoid k8s killing the container prematurely
        containers:
        - name: classifier
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model
      envSecretRefName: bpk-seldon-init-container-secret
#      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

For more info about RCLONE auth configuration:

Barteus commented 2 years ago

The workaround is to apply manually a secret with proper values to the namespace the inference will be done in.

Barteus commented 2 years ago

To replicate the issue:

  1. Run notebook: https://github.com/Barteus/kubeflow-examples/blob/main/seldon-mlflow-minio/mlflow-demo.ipynb and take Minio path from last cell output.
  2. Apply the SeldonDeployment from this issue (but change the modelUri) This should fail on downloading the model from minio.

To fix it, apply the working secret above and your deployment should work.

ca-scribner commented 2 years ago

@Barteus looking at this now. I see the issue raised but am not sure what your desired behaviour is from a fix. If someone has juju deploy'd mlflow and seldon, would you want:

  1. they not need to specify an envSecretRefName at all and it "just works"
  2. that there be a secret like seldon-init-container-secret that exists in each person's namespace that they can point to with envSecretRefName (and that that secret include the RCLONE stuff in the variable names)
  3. something else ? My guess is that you'd prefer (1) over (2), but (2) would still be good enough?

In your example that you ran (sorry, end of day so I'll actually run it myself tomorrow), are you creating SeldonDeployments in kubeflow or in a user's namespace? My guess is in kubeflow, but I don't know if that's the appropriate test case. Not 100% sure from the code whether the secret is namespaced to the seldon controller or the SeldonDeployment, but my guess would be the latter. Have you tested it that way? My expectation is that we need to put a secret with creds in every user's namespace, not just one in kubeflow.

Thoughts?

jardon commented 2 years ago

I did try to replicate this today but was unsuccessful. Having a Juju bundle could help in general with reducing the amount of time needed to replicate and confirm so that more time can be spent actually fixing the issue.

Here are the steps that I took:

  1. Installed microk8s v1.21
  2. Deploy kubeflow-lite via juju
  3. Added all of the mlflow bits and related them appropriately
  4. Imported and executed the jupyter notebook (also had to update the minio config)
  5. Grabbed the model_uri and added it to the example.yaml
  6. Applied the example.yaml
ca-scribner commented 2 years ago

Summary of my understanding before I make any fixes

Current state

Any SeldonDeployment that uses a model stored in s3 must be provided with s3 credentials (endpoint, access/secret key, etc as shown above) via a Secret (pointed to in SeldonDeployment.specs.predictors.componentSpecs.graph.envSecretRefName=secretName). This secret is formatted (shown in above comments) with the RCLONE prefix on everything (eg: RCLONE_CONFIG_S3_ACCESS_KEY_ID, etc) and must be in the same namespace as the SeldonDeployment.

At time of writing, out mlflow charm creates a secret seldon-init-container-secret with the s3 credentials but without the RCLONE_ prefix, and puts it in the same namespace as mlflow is deployed to (likely the kubeflow model/namespace). Because it is in mlflow's namespace, it is not accessible to SeldonDeployments from users (secrets are not accessible across namespace boundaries). Even if it was in the correct namespace (or if a user made a copy of the contents to their own namespace), the environment variables are still missing the RCLONE_ prefix.

Desired states

Some different desired states (they overlap a little and easier to write as separate thing that ideally would all be combined):

  1. For users of our MLOps offering, they should be able to easily commit a model to s3 (eg: via mlflow) and then deploy it via a SeldonDeployment from s3 without needing to know details about their s3 config, how to format the secrets passed to seldon, etc. The ideal would be that it works out of the box without fussing
  2. Users should be able to commit/deploy from their own s3 storage (or more likely their own bucket within a shared s3 storage) using personal credentials instead of globally shared ones

Some potentially useful things

Next steps

Near term

Achieving desired state (1) is tricky because we lack a way of putting secrets into all user namespaces. As an interim step, it is proposed that we update the seldon-init-container-secret generated by the mlflow charm to have the RCLONE_ prefixes and provide users with instructions for copying that secret. That at least makes things fairly easy for users.

Medium term

If we get a way to publish secrets to all namespaces (KF-220, upstream, or otherwise), we should:

Between these two things, users would then in their own namespace be able to deploy SeldonDeployments with little effort.

Long term

Thought is needed to determine how we could implement something like the medium term solution but where every user's secret is populated with their own credentials. This might be similar to discussions about enabling full artifact isolation within kfp (eg: each user's kfp writes to a bucket of their own rather than to a global bucket). The challenges are similar between them.

ca-scribner commented 2 years ago

The Near Term solution is up for review in canonical/mlflow-operator#27, and a gist of how I tested with it (following @Barteus's notebook pretty closely) is here. Note that when I tried to use that gist today though, there was a package conflict between mlflow/whatever else is needed in the deployment. To get it to work I had to manually edit the requirements.txt and conda.yaml files in the minio store to include itsdangerous==2.0.1

Update: I think the itsdangerous bug is the one described in canonical/seldon-core-operator#21, and is fixed upstream but we need to update our seldon charm to use the newest images.

ca-scribner commented 2 years ago

Re the itsdangerous bug, you can use a more recent image for your classifier with this patched. For example, you can do:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - name: classifier
          image: seldonio/mlflowserver:1.14.0-dev  # <--whatever version you want.  1.14.0-dev worked for me
...

This overrides the default mlflowserver version

i-chvets commented 1 year ago

Jira

i-chvets commented 1 year ago

Deployed Kiubeflow 1.6 and MLFlow. Retrieved MLFlow Seldon secret:

$ microk8s.kubectl -n kubeflow get secret mlflow-server-seldon-init-container-s3-credentials -o=yaml
apiVersion: v1
data:
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
  RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvOjkwMDA=
  RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
  RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: N0VSRlpaNTdKNzlYRUhQN0M2S0xJM1laN0s1VzVZ
  RCLONE_CONFIG_S3_TYPE: czM=
kind: Secret
metadata:
  annotations:
    controller.juju.is/id: f957d721-f53d-41b6-8ef1-662083ae049e
    model.juju.is/id: d0348bd4-17ab-4e48-84da-b30afeafdfc5
  creationTimestamp: "2022-11-28T16:50:41Z"
  labels:
    app.kubernetes.io/managed-by: juju
    app.kubernetes.io/name: mlflow-server
  name: mlflow-server-seldon-init-container-s3-credentials
  namespace: kubeflow
  resourceVersion: "31415"
  selfLink: /api/v1/namespaces/kubeflow/secrets/mlflow-server-seldon-init-container-s3-credentials
  uid: 61784723-8d79-41e6-8ab0-e98cca836f16
type: Opaque

Created Seldon secret based on the above and added it to user's namespace:

$ cat mlflow-server-seldon-init-container-s3-credentials.yaml 
apiVersion: v1
kind: Secret
metadata:
  name: mlflow-server-seldon-init-container-s3-credentials
  namespace: admin
type: Opaque
data:
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
  RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvOjkwMDA=
  RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
  RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: N0VSRlpaNTdKNzlYRUhQN0M2S0xJM1laN0s1VzVZ
  RCLONE_CONFIG_S3_TYPE: czM=
$ microk8s.kubectl -n admin apply -f mlflow-server-seldon-init-container-s3-credentials.yaml

Create/deploy Seldon deployment with modelUri pointing to valid model in MLFlow.

$ cat sample-seldon-deployment.yaml 
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
  - componentSpecs:
    - spec:
        # We are setting high failureThreshold as installing conda dependencies
        # can take long time and we want to avoid k8s killing the container prematurely
        containers:
        - name: classifier
          image: seldonio/mlflowserver:1.14.1
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model
      envSecretRefName: mlflow-server-seldon-init-container-s3-credentials
      name: classifier
    name: default
    replicas: 1
$ microk8s.kubectl -n admin apply -f sample-seldon-deployment.yaml

At this point the solution fails. The following errors are observed in classifier initialisation container:

2022/11/28 19:24:59 DEBUG : Setting endpoint="http://minio:9000" for "s3" from environment variable RCLONE_CONFIG_S3_ENDPOINT
2022/11/28 19:25:47 DEBUG : fs cache: renaming cache item "s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model" to be canonical "s3{EPcHk}:mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model"
2022/11/28 19:25:47 DEBUG : Creating backend with remote "/mnt/models"
2022/11/28 19:26:36 ERROR : S3 bucket mlflow path 0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model: error reading source root directory: RequestError: send request failed
caused by: Get "http://minio:9000/mlflow?delimiter=%2F&max-keys=1000&prefix=0%2Fada1cac674354dcd91eea9456d0d11b5%2Fartifacts%2Fmodels%2Fdata%2Fmodel%2F": 
dial tcp: lookup minio on 10.152.183.10:53: no such host
2022/11/28 19:26:36 DEBUG : Local file system at /mnt/models: Waiting for checks to finish
2022/11/28 19:26:36 ERROR : Attempt 1/3 failed with 1 errors and: RequestError: send request failed
caused by: Get "http://minio:9000/mlflow?delimiter=%2F&max-keys=1000&prefix=0%2Fada1cac674354dcd91eea9456d0d11b5%2Fartifacts%2Fmodels%2Fdata%2Fmodel%2F": dial tcp: lookup minio on 10.152.183.10:53: no such host

Endpoint is incorrectly encoded. Need to add namespace in URL encoding.

i-chvets commented 1 year ago

To verify:

  1. Deploy Kubeflow 1.6 Per Quick start guide

  2. Deploy MLFlow operator from the branch.

    juju deploy --series=kubernetes ./mlflow-server_ubuntu-20.04-amd64.charm mlflow-server --resource "oci-image=quay.io/helix-ml/mlflow:1.13.1"
    juju relate minio mlflow-server
    juju relate istio-pilot mlflow-server
    juju relate mlflow-db mlflow-server
    juju relate mlflow-server admission-webhook
  3. Verify secret.

    $ microk8s.kubectl -n kubeflow get secret mlflow-server-seldon-init-container-s3-credentials -o=yaml
    apiVersion: v1
    data:
    RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
    RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvLmt1YmVmbG93OjkwMDA=
    RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
    RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
    RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: MkQyUVo3WkNETkxQOVRSNTFaWVhBVlZWSTFEMEcx
    RCLONE_CONFIG_S3_TYPE: czM=
    kind: Secret
    metadata:
    annotations:
    controller.juju.is/id: f15622b9-f8be-4d11-8469-334259ab7c74
    model.juju.is/id: cb958a34-3242-441e-8b00-024b4955bbe3
    creationTimestamp: "2022-12-01T21:51:55Z"
    labels:
    app.kubernetes.io/managed-by: juju
    app.kubernetes.io/name: mlflow-server
    name: mlflow-server-seldon-init-container-s3-credentials
    namespace: kubeflow
    resourceVersion: "10938"
    selfLink: /api/v1/namespaces/kubeflow/secrets/mlflow-server-seldon-init-container-s3-credentials
    uid: 4c550348-c97c-4f44-97c7-4ef83f231d3f
    type: Opaque

    Decoded base64 RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvLmt1YmVmbG93OjkwMDA= is http://minio.kubeflow:9000

  4. Verify functionality. Created Seldon secret based on the above and added it to user's namespace:

    $ cat mlflow-server-seldon-init-container-s3-credentials.yaml 
    apiVersion: v1
    kind: Secret
    metadata:
    name: mlflow-server-seldon-init-container-s3-credentials
    namespace: admin
    type: Opaque
    data:
    RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
    RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvOjkwMDA=
    RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
    RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
    RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: N0VSRlpaNTdKNzlYRUhQN0M2S0xJM1laN0s1VzVZ
    RCLONE_CONFIG_S3_TYPE: czM=
    $ microk8s.kubectl -n admin apply -f mlflow-server-seldon-init-container-s3-credentials.yaml

    Create/deploy Seldon deployment with modelUri pointing to valid model in MLFlow.

    $ cat sample-seldon-deployment.yaml 
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
    name: mlflow
    spec:
    name: wines
    predictors:
    - componentSpecs:
    - spec:
        # We are setting high failureThreshold as installing conda dependencies
        # can take long time and we want to avoid k8s killing the container prematurely
        containers:
        - name: classifier
          image: seldonio/mlflowserver:1.14.1
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model
      envSecretRefName: mlflow-server-seldon-init-container-s3-credentials
      name: classifier
    name: default
    replicas: 1
    $ microk8s.kubectl -n admin apply -f sample-seldon-deployment.yaml

    Initialisation container successfully created S3 bucket using the above secret:

    2022/12/02 20:44:47 DEBUG : Creating backend with remote "s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model"
    2022/12/02 20:44:47 DEBUG : Setting type="s3" for "s3" from environment variable RCLONE_CONFIG_S3_TYPE
    2022/12/02 20:44:47 DEBUG : Setting provider="minio" for "s3" from environment variable RCLONE_CONFIG_S3_PROVIDER
    2022/12/02 20:44:47 DEBUG : Setting env_auth="false" for "s3" from environment variable RCLONE_CONFIG_S3_ENV_AUTH
    2022/12/02 20:44:47 DEBUG : Setting access_key_id="minio" for "s3" from environment variable RCLONE_CONFIG_S3_ACCESS_KEY_ID
    2022/12/02 20:44:47 DEBUG : Setting secret_access_key="X6LGENRTXYS0C3SE3NBUIUQFYYDMCH" for "s3" from environment variable RCLONE_CONFIG_S3_SECRET_ACCESS_KEY
    . . .
    2022/12/02 20:44:47 DEBUG : Creating backend with remote "/mnt/models"
    2022/12/02 20:44:47 DEBUG : Local file system at /mnt/models: Waiting for checks to finish
    2022/12/02 20:44:47 DEBUG : Local file system at /mnt/models: Waiting for transfers to finish
    2022/12/02 20:44:47 INFO  : 
    Transferred:              0 B / 0 B, -, 0 B/s, ETA -
    Elapsed time:         0.0s

    No model to transfer, that's why bytes transferred are zero. However, access to S3 is successful.

i-chvets commented 1 year ago

Fix is merged: https://github.com/canonical/mlflow-operator/pull/58

i-chvets commented 1 year ago

This issue can be closed.