When seldon-init-container-secret is used to deploy the model from minio the credentials passed to the RCLONE in init container are wrong. The Pod is going into the status Init:CrashLoopBackOff

Log from init container:

2022/02/09 08:52:58 NOTICE: Config file "/config/rclone/rclone.conf" not found - using defaults
2022/02/09 08:52:58 DEBUG : rclone: Version "v1.55.1" starting with parameters ["rclone" "copy" "-vv" "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model" "/mnt/models"]
2022/02/09 08:52:58 DEBUG : Creating backend with remote "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model"
2022/02/09 08:52:58 Failed to create file system for "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model": didn't find section in config file

Expected log from init container (removed not relevant sections):

2022/02/09 08:37:56 NOTICE: Config file "/config/rclone/rclone.conf" not found - using defaults
2022/02/09 08:37:56 DEBUG : rclone: Version "v1.55.1" starting with parameters ["rclone" "copy" "-vv" "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model" "/mnt/models"]
2022/02/09 08:37:56 DEBUG : Creating backend with remote "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model"
2022/02/09 08:37:56 DEBUG : s3: detected overridden config - adding "{jVufS}" suffix to name
2022/02/09 08:37:56 DEBUG : pacer: low level retry 1/10 (error RequestError: send request failed
caused by: Head "http://minio.kubeflow.svc.cluster.local:9000/mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model": EOF)
2022/02/09 08:37:56 DEBUG : pacer: Rate limited, increasing sleep to 10ms
...
2022/02/09 08:37:58 DEBUG : pacer: low level retry 10/10 (error RequestError: send request failed
caused by: Head "http://minio.kubeflow.svc.cluster.local:9000/mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model": EOF)
2022/02/09 08:37:58 DEBUG : fs cache: renaming cache item "s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model" to be canonical "s3{jVufS}:mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model"
2022/02/09 08:37:58 DEBUG : Creating backend with remote "/mnt/models"
2022/02/09 08:38:00 DEBUG : pacer: Reducing sleep to 1.5s
2022/02/09 08:38:00 DEBUG : Local file system at /mnt/models: Waiting for checks to finish
2022/02/09 08:38:00 DEBUG : Local file system at /mnt/models: Waiting for transfers to finish
2022/02/09 08:38:02 DEBUG : pacer: Reducing sleep to 1.125s
...
2022/02/09 08:38:06 DEBUG : pacer: Reducing sleep to 355.957031ms
2022/02/09 08:38:06 DEBUG : MLmodel: MD5 = a9bc2a512a382c222925645b10032c1c OK
2022/02/09 08:38:06 INFO  : MLmodel: Copied (new)
2022/02/09 08:38:07 DEBUG : pacer: Reducing sleep to 266.967773ms
2022/02/09 08:38:07 DEBUG : conda.yaml: MD5 = 8ff59fc0b665266cef1c86a60d92a006 OK
2022/02/09 08:38:07 INFO  : conda.yaml: Copied (new)
2022/02/09 08:38:07 DEBUG : pacer: Reducing sleep to 200.225829ms
2022/02/09 08:38:07 DEBUG : model.pkl: MD5 = 4423cef46a1eeb20736d88c980c11f3d OK
2022/02/09 08:38:07 INFO  : model.pkl: Copied (new)
2022/02/09 08:38:07 DEBUG : pacer: Reducing sleep to 150.169371ms
2022/02/09 08:38:07 DEBUG : requirements.txt: MD5 = 725d23405b4e11d989db76029151a90a OK
2022/02/09 08:38:07 INFO  : requirements.txt: Copied (new)
2022/02/09 08:38:07 INFO  : 
Transferred:        1.231k / 1.231 kBytes, 100%, 175 Bytes/s, ETA 0s
Transferred:            4 / 4, 100%
Elapsed time:        11.8s

2022/02/09 08:38:07 DEBUG : 5 go routines active

Working secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: bpk-seldon-init-container-secret
type: Opaque
stringData:
  RCLONE_CONFIG_S3_TYPE: s3
  RCLONE_CONFIG_S3_PROVIDER: minio
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: <key>
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: <secret>
  RCLONE_CONFIG_S3_ENDPOINT: http://minio.kubeflow.svc.cluster.local:9000
  RCLONE_CONFIG_S3_ENV_AUTH: "false"

Deployment example.yaml:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
  - componentSpecs:
    - spec:
        # We are setting high failureThreshold as installing conda dependencies
        # can take long time and we want to avoid k8s killing the container prematurely
        containers:
        - name: classifier
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlflow/0/a4958c23c9a74974bbd7c58a262a2c1e/artifacts/model
      envSecretRefName: bpk-seldon-init-container-secret
#      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

For more info about RCLONE auth configuration:

The workaround is to apply manually a secret with proper values to the namespace the inference will be done in.

To replicate the issue:

Run notebook: https://github.com/Barteus/kubeflow-examples/blob/main/seldon-mlflow-minio/mlflow-demo.ipynb and take Minio path from last cell output.
Apply the SeldonDeployment from this issue (but change the modelUri) This should fail on downloading the model from minio.

To fix it, apply the working secret above and your deployment should work.

@Barteus looking at this now. I see the issue raised but am not sure what your desired behaviour is from a fix. If someone has juju deploy'd mlflow and seldon, would you want:

they not need to specify an envSecretRefName at all and it "just works"
that there be a secret like seldon-init-container-secret that exists in each person's namespace that they can point to with envSecretRefName (and that that secret include the RCLONE stuff in the variable names)
something else ? My guess is that you'd prefer (1) over (2), but (2) would still be good enough?

In your example that you ran (sorry, end of day so I'll actually run it myself tomorrow), are you creating SeldonDeployments in kubeflow or in a user's namespace? My guess is in kubeflow, but I don't know if that's the appropriate test case. Not 100% sure from the code whether the secret is namespaced to the seldon controller or the SeldonDeployment, but my guess would be the latter. Have you tested it that way? My expectation is that we need to put a secret with creds in every user's namespace, not just one in kubeflow.

Thoughts?

I did try to replicate this today but was unsuccessful. Having a Juju bundle could help in general with reducing the amount of time needed to replicate and confirm so that more time can be spent actually fixing the issue.

Here are the steps that I took:

Installed microk8s v1.21
Deploy kubeflow-lite via juju
Added all of the mlflow bits and related them appropriately
Imported and executed the jupyter notebook (also had to update the minio config)
Grabbed the model_uri and added it to the example.yaml
Applied the example.yaml

Summary of my understanding before I make any fixes

Current state

Any SeldonDeployment that uses a model stored in s3 must be provided with s3 credentials (endpoint, access/secret key, etc as shown above) via a Secret (pointed to in SeldonDeployment.specs.predictors.componentSpecs.graph.envSecretRefName=secretName). This secret is formatted (shown in above comments) with the RCLONE prefix on everything (eg: RCLONE_CONFIG_S3_ACCESS_KEY_ID, etc) and must be in the same namespace as the SeldonDeployment.

At time of writing, out mlflow charm creates a secret seldon-init-container-secret with the s3 credentials but without the RCLONE_ prefix, and puts it in the same namespace as mlflow is deployed to (likely the kubeflow model/namespace). Because it is in mlflow's namespace, it is not accessible to SeldonDeployments from users (secrets are not accessible across namespace boundaries). Even if it was in the correct namespace (or if a user made a copy of the contents to their own namespace), the environment variables are still missing the RCLONE_ prefix.

Desired states

Some different desired states (they overlap a little and easier to write as separate thing that ideally would all be combined):

For users of our MLOps offering, they should be able to easily commit a model to s3 (eg: via mlflow) and then deploy it via a SeldonDeployment from s3 without needing to know details about their s3 config, how to format the secrets passed to seldon, etc. The ideal would be that it works out of the box without fussing
Users should be able to commit/deploy from their own s3 storage (or more likely their own bucket within a shared s3 storage) using personal credentials instead of globally shared ones

Some potentially useful things

Internally we've discussed building something that lets us inject PodDefaults into every user's namespace (KF-220). That could be extended to apply to secrets as well. There is a similar discussion upsteam in Kubeflow about adding a ProfileResourceTemplate CRD to the Profile Controller that would allow someone to define generic resources that should be deployed to all profile namespaces. That would be perfect for this, but nobody is actively developing it
Seldon init containers have globally definable settings that can be locally overwritten, meaning that our charm can deploy Seldon with defaults that play nice with our other tools (either by having conventions, or even by dynamically setting things based on relations), and that users can override these as necessary. So for example, we could set globally that envSecretRefName=seldon-init-container-secret, meaning that any SeldonDeployment init-container will look for a secret called seldon-init-container-secret in the current namespace unless the user says otherwise. That makes it easy for us to make things "just work", provided that we can populate the desired secret in every user's namespace
If we wanted to get fancier, custom init containers are available. They'd work essentially the same as the default rclone one and we could configure them without the user needing to know the details - they'd be useful if we need extra authentication steps or some other work that has to be done

Next steps

Near term

Achieving desired state (1) is tricky because we lack a way of putting secrets into all user namespaces. As an interim step, it is proposed that we update the seldon-init-container-secret generated by the mlflow charm to have the RCLONE_ prefixes and provide users with instructions for copying that secret. That at least makes things fairly easy for users.

Medium term

If we get a way to publish secrets to all namespaces (KF-220, upstream, or otherwise), we should:

implement this in the seldon charm, where if seldon is related to an s3 charm it then asks for a secret to be created with that s3 store's credentials in all users namespace
configure the defaults for the init-container to use that s3 secret as the default credentials
(unless there's a good reason not to) migrate secret creation out of mlflow and into the seldon charm itself. mlflow is the way we manage storing models, but the artifacts themselves are passed through mlflow into an underlying s3 store. When SeldonDeployments deploy models from s3, it is done independent of mlflow

Between these two things, users would then in their own namespace be able to deploy SeldonDeployments with little effort.

Long term

Thought is needed to determine how we could implement something like the medium term solution but where every user's secret is populated with their own credentials. This might be similar to discussions about enabling full artifact isolation within kfp (eg: each user's kfp writes to a bucket of their own rather than to a global bucket). The challenges are similar between them.

The Near Term solution is up for review in canonical/mlflow-operator#27, and a gist of how I tested with it (following @Barteus's notebook pretty closely) is here. Note that when I tried to use that gist today though, there was a package conflict between mlflow/whatever else is needed in the deployment. To get it to work I had to manually edit the requirements.txt and conda.yaml files in the minio store to include itsdangerous==2.0.1

Update: I think the itsdangerous bug is the one described in canonical/seldon-core-operator#21, and is fixed upstream but we need to update our seldon charm to use the newest images.

Re the itsdangerous bug, you can use a more recent image for your classifier with this patched. For example, you can do:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - name: classifier
          image: seldonio/mlflowserver:1.14.0-dev  # <--whatever version you want.  1.14.0-dev worked for me
...

This overrides the default mlflowserver version

Deployed Kiubeflow 1.6 and MLFlow. Retrieved MLFlow Seldon secret:

$ microk8s.kubectl -n kubeflow get secret mlflow-server-seldon-init-container-s3-credentials -o=yaml
apiVersion: v1
data:
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
  RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvOjkwMDA=
  RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
  RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: N0VSRlpaNTdKNzlYRUhQN0M2S0xJM1laN0s1VzVZ
  RCLONE_CONFIG_S3_TYPE: czM=
kind: Secret
metadata:
  annotations:
    controller.juju.is/id: f957d721-f53d-41b6-8ef1-662083ae049e
    model.juju.is/id: d0348bd4-17ab-4e48-84da-b30afeafdfc5
  creationTimestamp: "2022-11-28T16:50:41Z"
  labels:
    app.kubernetes.io/managed-by: juju
    app.kubernetes.io/name: mlflow-server
  name: mlflow-server-seldon-init-container-s3-credentials
  namespace: kubeflow
  resourceVersion: "31415"
  selfLink: /api/v1/namespaces/kubeflow/secrets/mlflow-server-seldon-init-container-s3-credentials
  uid: 61784723-8d79-41e6-8ab0-e98cca836f16
type: Opaque

Created Seldon secret based on the above and added it to user's namespace:

$ cat mlflow-server-seldon-init-container-s3-credentials.yaml 
apiVersion: v1
kind: Secret
metadata:
  name: mlflow-server-seldon-init-container-s3-credentials
  namespace: admin
type: Opaque
data:
  RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
  RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvOjkwMDA=
  RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
  RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
  RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: N0VSRlpaNTdKNzlYRUhQN0M2S0xJM1laN0s1VzVZ
  RCLONE_CONFIG_S3_TYPE: czM=
$ microk8s.kubectl -n admin apply -f mlflow-server-seldon-init-container-s3-credentials.yaml

Create/deploy Seldon deployment with modelUri pointing to valid model in MLFlow.

$ cat sample-seldon-deployment.yaml 
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
  - componentSpecs:
    - spec:
        # We are setting high failureThreshold as installing conda dependencies
        # can take long time and we want to avoid k8s killing the container prematurely
        containers:
        - name: classifier
          image: seldonio/mlflowserver:1.14.1
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model
      envSecretRefName: mlflow-server-seldon-init-container-s3-credentials
      name: classifier
    name: default
    replicas: 1
$ microk8s.kubectl -n admin apply -f sample-seldon-deployment.yaml

At this point the solution fails. The following errors are observed in classifier initialisation container:

2022/11/28 19:24:59 DEBUG : Setting endpoint="http://minio:9000" for "s3" from environment variable RCLONE_CONFIG_S3_ENDPOINT
2022/11/28 19:25:47 DEBUG : fs cache: renaming cache item "s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model" to be canonical "s3{EPcHk}:mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model"
2022/11/28 19:25:47 DEBUG : Creating backend with remote "/mnt/models"
2022/11/28 19:26:36 ERROR : S3 bucket mlflow path 0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model: error reading source root directory: RequestError: send request failed
caused by: Get "http://minio:9000/mlflow?delimiter=%2F&max-keys=1000&prefix=0%2Fada1cac674354dcd91eea9456d0d11b5%2Fartifacts%2Fmodels%2Fdata%2Fmodel%2F": 
dial tcp: lookup minio on 10.152.183.10:53: no such host
2022/11/28 19:26:36 DEBUG : Local file system at /mnt/models: Waiting for checks to finish
2022/11/28 19:26:36 ERROR : Attempt 1/3 failed with 1 errors and: RequestError: send request failed
caused by: Get "http://minio:9000/mlflow?delimiter=%2F&max-keys=1000&prefix=0%2Fada1cac674354dcd91eea9456d0d11b5%2Fartifacts%2Fmodels%2Fdata%2Fmodel%2F": dial tcp: lookup minio on 10.152.183.10:53: no such host

Endpoint is incorrectly encoded. Need to add namespace in URL encoding.

To verify:

Deploy Kubeflow 1.6 Per Quick start guide

Deploy MLFlow operator from the branch.

juju deploy --series=kubernetes ./mlflow-server_ubuntu-20.04-amd64.charm mlflow-server --resource "oci-image=quay.io/helix-ml/mlflow:1.13.1"
juju relate minio mlflow-server
juju relate istio-pilot mlflow-server
juju relate mlflow-db mlflow-server
juju relate mlflow-server admission-webhook

Verify secret.

$ microk8s.kubectl -n kubeflow get secret mlflow-server-seldon-init-container-s3-credentials -o=yaml
apiVersion: v1
data:
RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvLmt1YmVmbG93OjkwMDA=
RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: MkQyUVo3WkNETkxQOVRSNTFaWVhBVlZWSTFEMEcx
RCLONE_CONFIG_S3_TYPE: czM=
kind: Secret
metadata:
annotations:
controller.juju.is/id: f15622b9-f8be-4d11-8469-334259ab7c74
model.juju.is/id: cb958a34-3242-441e-8b00-024b4955bbe3
creationTimestamp: "2022-12-01T21:51:55Z"
labels:
app.kubernetes.io/managed-by: juju
app.kubernetes.io/name: mlflow-server
name: mlflow-server-seldon-init-container-s3-credentials
namespace: kubeflow
resourceVersion: "10938"
selfLink: /api/v1/namespaces/kubeflow/secrets/mlflow-server-seldon-init-container-s3-credentials
uid: 4c550348-c97c-4f44-97c7-4ef83f231d3f
type: Opaque

Decoded base64 RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvLmt1YmVmbG93OjkwMDA= is http://minio.kubeflow:9000

Verify functionality. Created Seldon secret based on the above and added it to user's namespace:

$ cat mlflow-server-seldon-init-container-s3-credentials.yaml 
apiVersion: v1
kind: Secret
metadata:
name: mlflow-server-seldon-init-container-s3-credentials
namespace: admin
type: Opaque
data:
RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvOjkwMDA=
RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: N0VSRlpaNTdKNzlYRUhQN0M2S0xJM1laN0s1VzVZ
RCLONE_CONFIG_S3_TYPE: czM=
$ microk8s.kubectl -n admin apply -f mlflow-server-seldon-init-container-s3-credentials.yaml

Create/deploy Seldon deployment with modelUri pointing to valid model in MLFlow.

$ cat sample-seldon-deployment.yaml 
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: mlflow
spec:
name: wines
predictors:
- componentSpecs:
- spec:
    # We are setting high failureThreshold as installing conda dependencies
    # can take long time and we want to avoid k8s killing the container prematurely
    containers:
    - name: classifier
      image: seldonio/mlflowserver:1.14.1
      livenessProbe:
        initialDelaySeconds: 80
        failureThreshold: 200
        periodSeconds: 5
        successThreshold: 1
        httpGet:
          path: /health/ping
          port: http
          scheme: HTTP
      readinessProbe:
        initialDelaySeconds: 80
        failureThreshold: 200
        periodSeconds: 5
        successThreshold: 1
        httpGet:
          path: /health/ping
          port: http
          scheme: HTTP
graph:
  children: []
  implementation: MLFLOW_SERVER
  modelUri: s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model
  envSecretRefName: mlflow-server-seldon-init-container-s3-credentials
  name: classifier
name: default
replicas: 1
$ microk8s.kubectl -n admin apply -f sample-seldon-deployment.yaml

Initialisation container successfully created S3 bucket using the above secret:

2022/12/02 20:44:47 DEBUG : Creating backend with remote "s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model"
2022/12/02 20:44:47 DEBUG : Setting type="s3" for "s3" from environment variable RCLONE_CONFIG_S3_TYPE
2022/12/02 20:44:47 DEBUG : Setting provider="minio" for "s3" from environment variable RCLONE_CONFIG_S3_PROVIDER
2022/12/02 20:44:47 DEBUG : Setting env_auth="false" for "s3" from environment variable RCLONE_CONFIG_S3_ENV_AUTH
2022/12/02 20:44:47 DEBUG : Setting access_key_id="minio" for "s3" from environment variable RCLONE_CONFIG_S3_ACCESS_KEY_ID
2022/12/02 20:44:47 DEBUG : Setting secret_access_key="X6LGENRTXYS0C3SE3NBUIUQFYYDMCH" for "s3" from environment variable RCLONE_CONFIG_S3_SECRET_ACCESS_KEY
. . .
2022/12/02 20:44:47 DEBUG : Creating backend with remote "/mnt/models"
2022/12/02 20:44:47 DEBUG : Local file system at /mnt/models: Waiting for checks to finish
2022/12/02 20:44:47 DEBUG : Local file system at /mnt/models: Waiting for transfers to finish
2022/12/02 20:44:47 INFO  : 
Transferred:              0 B / 0 B, -, 0 B/s, ETA -
Elapsed time:         0.0s

No model to transfer, that's why bytes transferred are zero. However, access to S3 is successful.

This issue can be closed.

canonical / bundle-kubeflow

mlflow secret used for seldon is not working (seldon-init-container-secret) #429