Closed Barteus closed 1 year ago
The workaround is to apply manually a secret with proper values to the namespace the inference will be done in.
To replicate the issue:
To fix it, apply the working secret above and your deployment should work.
@Barteus looking at this now. I see the issue raised but am not sure what your desired behaviour is from a fix. If someone has juju deploy
'd mlflow and seldon, would you want:
envSecretRefName
at all and it "just works"seldon-init-container-secret
that exists in each person's namespace that they can point to with envSecretRefName
(and that that secret include the RCLONE stuff in the variable names)In your example that you ran (sorry, end of day so I'll actually run it myself tomorrow), are you creating SeldonDeployment
s in kubeflow
or in a user's namespace? My guess is in kubeflow
, but I don't know if that's the appropriate test case. Not 100% sure from the code whether the secret is namespaced to the seldon controller or the SeldonDeployment
, but my guess would be the latter. Have you tested it that way? My expectation is that we need to put a secret with creds in every user's namespace, not just one in kubeflow
.
Thoughts?
I did try to replicate this today but was unsuccessful. Having a Juju bundle could help in general with reducing the amount of time needed to replicate and confirm so that more time can be spent actually fixing the issue.
Here are the steps that I took:
example.yaml
example.yaml
Summary of my understanding before I make any fixes
Any SeldonDeployment
that uses a model stored in s3 must be provided with s3 credentials (endpoint, access/secret key, etc as shown above) via a Secret
(pointed to in SeldonDeployment.specs.predictors.componentSpecs.graph.envSecretRefName=secretName
). This secret is formatted (shown in above comments) with the RCLONE
prefix on everything (eg: RCLONE_CONFIG_S3_ACCESS_KEY_ID
, etc) and must be in the same namespace as the SeldonDeployment
.
At time of writing, out mlflow
charm creates a secret
seldon-init-container-secret
with the s3 credentials but without the RCLONE_
prefix, and puts it in the same namespace as mlflow
is deployed to (likely the kubeflow
model/namespace). Because it is in mlflow's namespace, it is not accessible to SeldonDeployment
s from users (secret
s are not accessible across namespace boundaries). Even if it was in the correct namespace (or if a user made a copy of the contents to their own namespace), the environment variables are still missing the RCLONE_
prefix.
Some different desired states (they overlap a little and easier to write as separate thing that ideally would all be combined):
SeldonDeployment
from s3 without needing to know details about their s3 config, how to format the secrets passed to seldon, etc. The ideal would be that it works out of the box without fussingPodDefault
s into every user's namespace (KF-220). That could be extended to apply to secret
s as well. There is a similar discussion upsteam in Kubeflow about adding a ProfileResourceTemplate
CRD to the Profile Controller that would allow someone to define generic resources that should be deployed to all profile namespaces. That would be perfect for this, but nobody is actively developing it envSecretRefName=seldon-init-container-secret
, meaning that any SeldonDeployment
init-container
will look for a secret called seldon-init-container-secret
in the current namespace unless the user says otherwise. That makes it easy for us to make things "just work", provided that we can populate the desired secret in every user's namespaceAchieving desired state (1) is tricky because we lack a way of putting secrets into all user namespaces. As an interim step, it is proposed that we update the seldon-init-container-secret
generated by the mlflow charm to have the RCLONE_
prefixes and provide users with instructions for copying that secret. That at least makes things fairly easy for users.
If we get a way to publish secrets to all namespaces (KF-220, upstream, or otherwise), we should:
SeldonDeployment
s deploy models from s3, it is done independent of mlflow Between these two things, users would then in their own namespace be able to deploy SeldonDeployment
s with little effort.
Thought is needed to determine how we could implement something like the medium term solution but where every user's secret
is populated with their own credentials. This might be similar to discussions about enabling full artifact isolation within kfp (eg: each user's kfp writes to a bucket of their own rather than to a global bucket). The challenges are similar between them.
The Near Term solution is up for review in canonical/mlflow-operator#27, and a gist of how I tested with it (following @Barteus's notebook pretty closely) is here. Note that when I tried to use that gist today though, there was a package conflict between mlflow/whatever else is needed in the deployment. To get it to work I had to manually edit the requirements.txt
and conda.yaml
files in the minio store to include itsdangerous==2.0.1
Update: I think the itsdangerous
bug is the one described in canonical/seldon-core-operator#21, and is fixed upstream but we need to update our seldon charm to use the newest images.
Re the itsdangerous
bug, you can use a more recent image for your classifier with this patched. For example, you can do:
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: mlflow
spec:
predictors:
- componentSpecs:
- spec:
containers:
- name: classifier
image: seldonio/mlflowserver:1.14.0-dev # <--whatever version you want. 1.14.0-dev worked for me
...
This overrides the default mlflowserver version
Deployed Kiubeflow 1.6 and MLFlow. Retrieved MLFlow Seldon secret:
$ microk8s.kubectl -n kubeflow get secret mlflow-server-seldon-init-container-s3-credentials -o=yaml
apiVersion: v1
data:
RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvOjkwMDA=
RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: N0VSRlpaNTdKNzlYRUhQN0M2S0xJM1laN0s1VzVZ
RCLONE_CONFIG_S3_TYPE: czM=
kind: Secret
metadata:
annotations:
controller.juju.is/id: f957d721-f53d-41b6-8ef1-662083ae049e
model.juju.is/id: d0348bd4-17ab-4e48-84da-b30afeafdfc5
creationTimestamp: "2022-11-28T16:50:41Z"
labels:
app.kubernetes.io/managed-by: juju
app.kubernetes.io/name: mlflow-server
name: mlflow-server-seldon-init-container-s3-credentials
namespace: kubeflow
resourceVersion: "31415"
selfLink: /api/v1/namespaces/kubeflow/secrets/mlflow-server-seldon-init-container-s3-credentials
uid: 61784723-8d79-41e6-8ab0-e98cca836f16
type: Opaque
Created Seldon secret based on the above and added it to user's namespace:
$ cat mlflow-server-seldon-init-container-s3-credentials.yaml
apiVersion: v1
kind: Secret
metadata:
name: mlflow-server-seldon-init-container-s3-credentials
namespace: admin
type: Opaque
data:
RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvOjkwMDA=
RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: N0VSRlpaNTdKNzlYRUhQN0M2S0xJM1laN0s1VzVZ
RCLONE_CONFIG_S3_TYPE: czM=
$ microk8s.kubectl -n admin apply -f mlflow-server-seldon-init-container-s3-credentials.yaml
Create/deploy Seldon deployment with modelUri
pointing to valid model in MLFlow.
$ cat sample-seldon-deployment.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: mlflow
spec:
name: wines
predictors:
- componentSpecs:
- spec:
# We are setting high failureThreshold as installing conda dependencies
# can take long time and we want to avoid k8s killing the container prematurely
containers:
- name: classifier
image: seldonio/mlflowserver:1.14.1
livenessProbe:
initialDelaySeconds: 80
failureThreshold: 200
periodSeconds: 5
successThreshold: 1
httpGet:
path: /health/ping
port: http
scheme: HTTP
readinessProbe:
initialDelaySeconds: 80
failureThreshold: 200
periodSeconds: 5
successThreshold: 1
httpGet:
path: /health/ping
port: http
scheme: HTTP
graph:
children: []
implementation: MLFLOW_SERVER
modelUri: s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model
envSecretRefName: mlflow-server-seldon-init-container-s3-credentials
name: classifier
name: default
replicas: 1
$ microk8s.kubectl -n admin apply -f sample-seldon-deployment.yaml
At this point the solution fails. The following errors are observed in classifier initialisation container:
2022/11/28 19:24:59 DEBUG : Setting endpoint="http://minio:9000" for "s3" from environment variable RCLONE_CONFIG_S3_ENDPOINT
2022/11/28 19:25:47 DEBUG : fs cache: renaming cache item "s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model" to be canonical "s3{EPcHk}:mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model"
2022/11/28 19:25:47 DEBUG : Creating backend with remote "/mnt/models"
2022/11/28 19:26:36 ERROR : S3 bucket mlflow path 0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model: error reading source root directory: RequestError: send request failed
caused by: Get "http://minio:9000/mlflow?delimiter=%2F&max-keys=1000&prefix=0%2Fada1cac674354dcd91eea9456d0d11b5%2Fartifacts%2Fmodels%2Fdata%2Fmodel%2F":
dial tcp: lookup minio on 10.152.183.10:53: no such host
2022/11/28 19:26:36 DEBUG : Local file system at /mnt/models: Waiting for checks to finish
2022/11/28 19:26:36 ERROR : Attempt 1/3 failed with 1 errors and: RequestError: send request failed
caused by: Get "http://minio:9000/mlflow?delimiter=%2F&max-keys=1000&prefix=0%2Fada1cac674354dcd91eea9456d0d11b5%2Fartifacts%2Fmodels%2Fdata%2Fmodel%2F": dial tcp: lookup minio on 10.152.183.10:53: no such host
Endpoint is incorrectly encoded. Need to add namespace in URL encoding.
To verify:
Deploy Kubeflow 1.6 Per Quick start guide
Deploy MLFlow operator from the branch.
juju deploy --series=kubernetes ./mlflow-server_ubuntu-20.04-amd64.charm mlflow-server --resource "oci-image=quay.io/helix-ml/mlflow:1.13.1"
juju relate minio mlflow-server
juju relate istio-pilot mlflow-server
juju relate mlflow-db mlflow-server
juju relate mlflow-server admission-webhook
Verify secret.
$ microk8s.kubectl -n kubeflow get secret mlflow-server-seldon-init-container-s3-credentials -o=yaml
apiVersion: v1
data:
RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvLmt1YmVmbG93OjkwMDA=
RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: MkQyUVo3WkNETkxQOVRSNTFaWVhBVlZWSTFEMEcx
RCLONE_CONFIG_S3_TYPE: czM=
kind: Secret
metadata:
annotations:
controller.juju.is/id: f15622b9-f8be-4d11-8469-334259ab7c74
model.juju.is/id: cb958a34-3242-441e-8b00-024b4955bbe3
creationTimestamp: "2022-12-01T21:51:55Z"
labels:
app.kubernetes.io/managed-by: juju
app.kubernetes.io/name: mlflow-server
name: mlflow-server-seldon-init-container-s3-credentials
namespace: kubeflow
resourceVersion: "10938"
selfLink: /api/v1/namespaces/kubeflow/secrets/mlflow-server-seldon-init-container-s3-credentials
uid: 4c550348-c97c-4f44-97c7-4ef83f231d3f
type: Opaque
Decoded base64 RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvLmt1YmVmbG93OjkwMDA=
is http://minio.kubeflow:9000
Verify functionality. Created Seldon secret based on the above and added it to user's namespace:
$ cat mlflow-server-seldon-init-container-s3-credentials.yaml
apiVersion: v1
kind: Secret
metadata:
name: mlflow-server-seldon-init-container-s3-credentials
namespace: admin
type: Opaque
data:
RCLONE_CONFIG_S3_ACCESS_KEY_ID: bWluaW8=
RCLONE_CONFIG_S3_ENDPOINT: aHR0cDovL21pbmlvOjkwMDA=
RCLONE_CONFIG_S3_ENV_AUTH: ZmFsc2U=
RCLONE_CONFIG_S3_PROVIDER: bWluaW8=
RCLONE_CONFIG_S3_SECRET_ACCESS_KEY: N0VSRlpaNTdKNzlYRUhQN0M2S0xJM1laN0s1VzVZ
RCLONE_CONFIG_S3_TYPE: czM=
$ microk8s.kubectl -n admin apply -f mlflow-server-seldon-init-container-s3-credentials.yaml
Create/deploy Seldon deployment with modelUri
pointing to valid model in MLFlow.
$ cat sample-seldon-deployment.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: mlflow
spec:
name: wines
predictors:
- componentSpecs:
- spec:
# We are setting high failureThreshold as installing conda dependencies
# can take long time and we want to avoid k8s killing the container prematurely
containers:
- name: classifier
image: seldonio/mlflowserver:1.14.1
livenessProbe:
initialDelaySeconds: 80
failureThreshold: 200
periodSeconds: 5
successThreshold: 1
httpGet:
path: /health/ping
port: http
scheme: HTTP
readinessProbe:
initialDelaySeconds: 80
failureThreshold: 200
periodSeconds: 5
successThreshold: 1
httpGet:
path: /health/ping
port: http
scheme: HTTP
graph:
children: []
implementation: MLFLOW_SERVER
modelUri: s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model
envSecretRefName: mlflow-server-seldon-init-container-s3-credentials
name: classifier
name: default
replicas: 1
$ microk8s.kubectl -n admin apply -f sample-seldon-deployment.yaml
Initialisation container successfully created S3 bucket using the above secret:
2022/12/02 20:44:47 DEBUG : Creating backend with remote "s3://mlflow/0/ada1cac674354dcd91eea9456d0d11b5/artifacts/models/data/model"
2022/12/02 20:44:47 DEBUG : Setting type="s3" for "s3" from environment variable RCLONE_CONFIG_S3_TYPE
2022/12/02 20:44:47 DEBUG : Setting provider="minio" for "s3" from environment variable RCLONE_CONFIG_S3_PROVIDER
2022/12/02 20:44:47 DEBUG : Setting env_auth="false" for "s3" from environment variable RCLONE_CONFIG_S3_ENV_AUTH
2022/12/02 20:44:47 DEBUG : Setting access_key_id="minio" for "s3" from environment variable RCLONE_CONFIG_S3_ACCESS_KEY_ID
2022/12/02 20:44:47 DEBUG : Setting secret_access_key="X6LGENRTXYS0C3SE3NBUIUQFYYDMCH" for "s3" from environment variable RCLONE_CONFIG_S3_SECRET_ACCESS_KEY
. . .
2022/12/02 20:44:47 DEBUG : Creating backend with remote "/mnt/models"
2022/12/02 20:44:47 DEBUG : Local file system at /mnt/models: Waiting for checks to finish
2022/12/02 20:44:47 DEBUG : Local file system at /mnt/models: Waiting for transfers to finish
2022/12/02 20:44:47 INFO :
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Elapsed time: 0.0s
No model to transfer, that's why bytes transferred are zero. However, access to S3 is successful.
Fix is merged: https://github.com/canonical/mlflow-operator/pull/58
This issue can be closed.
When seldon-init-container-secret is used to deploy the model from minio the credentials passed to the RCLONE in init container are wrong. The Pod is going into the status Init:CrashLoopBackOff
Log from init container:
Expected log from init container (removed not relevant sections):
Working secret.yaml:
Deployment example.yaml:
For more info about RCLONE auth configuration: