Closed mattmbk closed 4 years ago
Interesting the image we're using right now is gcr.io/kfserving/model-initializer
which is built as part of kfserving. So that's running https://github.com/kubeflow/kfserving/blob/f98acc5a90e530f6599753250572c7f6b1f385e2/python/storage-initializer/scripts/initializer-entrypoint which calls https://github.com/kubeflow/kfserving/blob/bba20a6e54f92deb01c76c0c2cc3554a2ccc43f1/python/kfserving/kfserving/storage.py. Odd that the prefix specified in there is file://
as the example has it as pvc://
. Maybe it's getting translated somewhere along the line. Or maybe it isn't (in which case that could be the issue).
I've done some further investigation and it seems that the issue is in the usage of the latest tag for gcr.io/kfserving/model-initializer
Inside the container using the latest tag the contents of /kfserving/kfserving/storage.py's _download_local method shows:
@staticmethod def _download_local(uri): local_path = uri.replace(_LOCAL_PREFIX, "", 1) if not os.path.exists(local_path): raise Exception("Local path %s does not exist." % (uri)) return local_path
This seems to be essentially a no-op.
However, looking at the version of that file in master it seems to copy the contents of the local path (as expected).
I verified that using the v0.1.2 tag has the proper kfserving version and the initialize script actually copies the directory contents.
Interesting, it seems the image to use for the initializer is now configurable in KFServing. We need to pin each Seldon release to a specific version and make it configurable in Seldon too.
@mattmbk I'm working on making this configurable. Will let you know when ready.
@mattmbk In the latest snapshot the image defaults to 0.2.1 and can be configured via a configmap. Am now trying to test a pvc scenario.
@mattmbk did you mean v0.1.2 and not 0.2.1? I notice the image has changed name from model-initializer
in v0.1.2 to storage-initializer
in 0.2.1 (the v
tag prefix seems to have been dropped too)
I was able to get it working by creating pvc called seldon-mnist-tf-9jkg7-modelpvc
and then submitting this TFJob to create a tensorflow model:
{
"apiVersion": "kubeflow.org/v1",
"kind": "TFJob",
"metadata": {
"name": "mnist-train-test1",
"namespace": "kubeflow"
},
"spec": {
"tfReplicaSpecs": {
"Worker": {
"replicas": 1,
"template": {
"spec": {
"containers": [
{
"image": "seldonio/deepmnistclassifier_trainer:0.3",
"name": "tensorflow",
"volumeMounts": [
{
"mountPath": "/data",
"name": "persistent-storage"
}
]
}
],
"restartPolicy": "OnFailure",
"volumes": [
{
"name": "persistent-storage",
"persistentVolumeClaim": {
"claimName": "seldon-mnist-tf-9jkg7-modelpvc"
}
}
]
}
},
"tfReplicaType": "MASTER"
}
}
}
}
Then I ran this in a SeldonDeployment with:
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: tfserving
namespace: kubeflow
spec:
name: mnist
predictors:
- graph:
children: []
implementation: TENSORFLOW_SERVER
modelUri: "pvc://seldon-mnist-tf-9jkg7-modelpvc/"
name: mnist-model
parameters:
- name: signature_name
type: STRING
value: predict_images
- name: model_name
type: STRING
value: mnist-model
name: default
replicas: 1
I then did a kubectl exec -it -n kubeflow <podname> -c tfserving bash
and ran ls -ltr /mnt/models
and verified it was the same files listed in kubectl logs -n kubeflow <pod_name> tfserving-model-initializer
. This worked for me on both 0.2.1 and the older v0.1.2
@ryandawsonuk Yes, apologies, I mean v0.1.2 as you said.
@ryandawsonuk Is the snapshot chart containing these changes published on https://storage.googleapis.com/seldon-charts ?
I fetched 0.5.1-SNAPSHOT and didn't see these changes in there yet. Thanks!
Oh the image has been published but the chart might not have been. I installed from source version of the chart in the github repo
OK thanks, I was able to update.
@ryandawsonuk were you able to query the model server?
I'm still seeing errors in the tfserving container in my pod -- it looks like the storage-initializer is symlinking the pvc contents, but the pvc isn't mounted in the tfserving container?
Am trying to check this but having some unrelated problems. Should be able to dig into it properly soon. Any further info you can offer on this is much appreciated.
Yeah when I make the request I see tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:267] No versions of servable mnist-model found under base path /mnt/models
and after doing a kubectl exec
I see the files there as symlinks:
But the /mnt/pvc path that they reference is not present.
I am seeing the same problem in the KFServing project that we based this on so I've raised the issue there.
With seldon it is possible to define the full Pod specification so the issue could be worked around like that. But do want get this working for the simpler TENSORFLOW_SERVER spec.
Inspecting the yaml of the Pod, the initContainer has:
volumeMounts:
- mountPath: /mnt/pvc
name: kfserving-pvc-source
readOnly: true
- mountPath: /mnt/models
name: tfserving-provision-location
Whereas the tfserving container has:
volumeMounts:
- mountPath: /mnt/models
name: tfserving-provision-location
readOnly: true
This is presumably why it can't see the /mnt/pvc location. That mountPath is present in KFServing so I'll correct this in Seldon. Not entirely sure if it will resolve the issue (because I see the same error in KFServing) but it will take us a step forwards.
The volumeMount paths now should be correct but still get the error, as I do with KFServing. But somebody there says they got the KFServing example working. That would be likely to work for a SeldonDeployment too but unfortunately I can't currently replicate what they did.
Great, it seems like this is working now. Thanks for the support!
That’s good to hear, glad it’s working for you.
Hopefully we’ll be able to get an example into the docs for this.
I created a pvc named seldon-mnist-tfjob-m68v5-modelpvc
. I then followed the KFServing example and ran a TFJob to build a model. It's a bit long:
apiVersion: v1
data:
batchSize: "100"
exportDir: /mnt/export
learningRate: "0.02"
modelDir: /mnt
name: tfjob-021
pvcMountPath: /mnt
pvcName: fengpvc
trainSteps: "200"
kind: ConfigMap
metadata:
name: mnist-map-training-4t25c985bg
namespace: kubeflow
---
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tfjob-021
namespace: kubeflow
spec:
tfReplicaSpecs:
Chief:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- /usr/bin/python
- /opt/model.py
- --tf-model-dir=$(modelDir)
- --tf-export-dir=$(exportDir)
- --tf-train-steps=$(trainSteps)
- --tf-batch-size=$(batchSize)
- --tf-learning-rate=$(learningRate)
env:
- name: modelDir
value: /mnt
- name: exportDir
value: /mnt/export
- name: trainSteps
value: "200"
- name: batchSize
value: "100"
- name: learningRate
value: "0.02"
image: ryandawsonuk/kftfmodel:0.0.1
name: tensorflow
volumeMounts:
- mountPath: /mnt
name: local-storage
workingDir: /opt
restartPolicy: OnFailure
volumes:
- name: local-storage
persistentVolumeClaim:
claimName: seldon-mnist-tfjob-m68v5-modelpvc
Ps:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- /usr/bin/python
- /opt/model.py
- --tf-model-dir=$(modelDir)
- --tf-export-dir=$(exportDir)
- --tf-train-steps=$(trainSteps)
- --tf-batch-size=$(batchSize)
- --tf-learning-rate=$(learningRate)
env:
- name: modelDir
value: /mnt
- name: exportDir
value: /mnt/export
- name: trainSteps
value: "200"
- name: batchSize
value: "100"
- name: learningRate
value: "0.02"
image: ryandawsonuk/kftfmodel:0.0.1
name: tensorflow
volumeMounts:
- mountPath: /mnt
name: local-storage
workingDir: /opt
restartPolicy: OnFailure
volumes:
- name: local-storage
persistentVolumeClaim:
claimName: seldon-mnist-tfjob-m68v5-modelpvc
Worker:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- /usr/bin/python
- /opt/model.py
- --tf-model-dir=$(modelDir)
- --tf-export-dir=$(exportDir)
- --tf-train-steps=$(trainSteps)
- --tf-batch-size=$(batchSize)
- --tf-learning-rate=$(learningRate)
env:
- name: modelDir
value: /mnt
- name: exportDir
value: /mnt/export
- name: trainSteps
value: "200"
- name: batchSize
value: "100"
- name: learningRate
value: "0.02"
image: ryandawsonuk/kftfmodel:0.0.1
name: tensorflow
volumeMounts:
- mountPath: /mnt
name: local-storage
workingDir: /opt
restartPolicy: OnFailure
volumes:
- name: local-storage
persistentVolumeClaim:
claimName: seldon-mnist-tfjob-m68v5-modelpvc
I then ran this using:
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: tfserving
namespace: kubeflow
spec:
name: mnist
predictors:
- graph:
children: []
implementation: TENSORFLOW_SERVER
modelUri: "pvc://seldon-mnist-tfjob-m68v5-modelpvc/export"
name: mnist-model
parameters:
- name: signature_name
type: STRING
value: predict_images
- name: model_name
type: STRING
value: mnist-model
name: default
replicas: 1
I think it might be important that it uses a path (/export
).
On k8s using seldon-core-operator:0.5.0 installed via helm.
I have a saved tensorflow model on a PVC which I'd like to serve. Example yaml is here: `apiVersion: machinelearning.seldon.io/v1alpha2 kind: SeldonDeployment metadata: name: tfserving spec: name: mymodel predictors:
It seems that the initcontainer is copying the model from the PVC to the tfserving-provision-location volume from the logs: INFO:root:Initializing, args: src_uri [/mnt/pvc/mymodel/serving_model] dest_path[ [/mnt/models] INFO:root:Copying contents of /mnt/pvc/mymodel/serving_model to local
But, if i spawn a shell into the tfserving container it is not there: /# ls -la /mnt/models total 8 drwxrwxrwx 2 root root 4096 Nov 14 16:00 . drwxr-xr-x 1 root root 4096 Nov 14 16:00 ..
I have verified that the PVC is not empty (contains a directory with a model ID which in turn contains a saved_model.pb).
It seems that the init container is not copying the model directory/files at all?
I tried recreating the initcontainer and running the "/model-initializer/scripts/initializer-entrypoint" script and it does not seem to copy my model.