SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
https://www.seldon.io/tech/products/core/
Other
4.38k stars 831 forks source link

Unable to mount model from PVC into tf serving prepackaged model server #1106

Closed mattmbk closed 4 years ago

mattmbk commented 4 years ago

On k8s using seldon-core-operator:0.5.0 installed via helm.

I have a saved tensorflow model on a PVC which I'd like to serve. Example yaml is here: `apiVersion: machinelearning.seldon.io/v1alpha2 kind: SeldonDeployment metadata: name: tfserving spec: name: mymodel predictors:

It seems that the initcontainer is copying the model from the PVC to the tfserving-provision-location volume from the logs: INFO:root:Initializing, args: src_uri [/mnt/pvc/mymodel/serving_model] dest_path[ [/mnt/models] INFO:root:Copying contents of /mnt/pvc/mymodel/serving_model to local

But, if i spawn a shell into the tfserving container it is not there: /# ls -la /mnt/models total 8 drwxrwxrwx 2 root root 4096 Nov 14 16:00 . drwxr-xr-x 1 root root 4096 Nov 14 16:00 ..

I have verified that the PVC is not empty (contains a directory with a model ID which in turn contains a saved_model.pb).

It seems that the init container is not copying the model directory/files at all?

I tried recreating the initcontainer and running the "/model-initializer/scripts/initializer-entrypoint" script and it does not seem to copy my model.

ryandawsonuk commented 4 years ago

Interesting the image we're using right now is gcr.io/kfserving/model-initializer which is built as part of kfserving. So that's running https://github.com/kubeflow/kfserving/blob/f98acc5a90e530f6599753250572c7f6b1f385e2/python/storage-initializer/scripts/initializer-entrypoint which calls https://github.com/kubeflow/kfserving/blob/bba20a6e54f92deb01c76c0c2cc3554a2ccc43f1/python/kfserving/kfserving/storage.py. Odd that the prefix specified in there is file:// as the example has it as pvc://. Maybe it's getting translated somewhere along the line. Or maybe it isn't (in which case that could be the issue).

mattmbk commented 4 years ago

I've done some further investigation and it seems that the issue is in the usage of the latest tag for gcr.io/kfserving/model-initializer

Inside the container using the latest tag the contents of /kfserving/kfserving/storage.py's _download_local method shows: @staticmethod def _download_local(uri): local_path = uri.replace(_LOCAL_PREFIX, "", 1) if not os.path.exists(local_path): raise Exception("Local path %s does not exist." % (uri)) return local_path

This seems to be essentially a no-op.

However, looking at the version of that file in master it seems to copy the contents of the local path (as expected).

I verified that using the v0.1.2 tag has the proper kfserving version and the initialize script actually copies the directory contents.

ryandawsonuk commented 4 years ago

Interesting, it seems the image to use for the initializer is now configurable in KFServing. We need to pin each Seldon release to a specific version and make it configurable in Seldon too.

ryandawsonuk commented 4 years ago

@mattmbk I'm working on making this configurable. Will let you know when ready.

ryandawsonuk commented 4 years ago

@mattmbk In the latest snapshot the image defaults to 0.2.1 and can be configured via a configmap. Am now trying to test a pvc scenario.

ryandawsonuk commented 4 years ago

@mattmbk did you mean v0.1.2 and not 0.2.1? I notice the image has changed name from model-initializer in v0.1.2 to storage-initializer in 0.2.1 (the v tag prefix seems to have been dropped too)

ryandawsonuk commented 4 years ago

I was able to get it working by creating pvc called seldon-mnist-tf-9jkg7-modelpvc and then submitting this TFJob to create a tensorflow model:

{
  "apiVersion": "kubeflow.org/v1",
  "kind": "TFJob",
  "metadata": {
    "name": "mnist-train-test1",
    "namespace": "kubeflow"
  },
  "spec": {
    "tfReplicaSpecs": {
      "Worker": {
        "replicas": 1,
        "template": {
          "spec": {
            "containers": [
              {
                "image": "seldonio/deepmnistclassifier_trainer:0.3",
                "name": "tensorflow",
                "volumeMounts": [
                  {
                    "mountPath": "/data",
                    "name": "persistent-storage"
                  }
                ]
              }
            ],
            "restartPolicy": "OnFailure",
            "volumes": [
              {
                "name": "persistent-storage",
                "persistentVolumeClaim": {
                  "claimName": "seldon-mnist-tf-9jkg7-modelpvc"
                }
              }
            ]
          }
        },
        "tfReplicaType": "MASTER"
      }
    }
  }
}

Then I ran this in a SeldonDeployment with:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: tfserving
  namespace: kubeflow
spec:
  name: mnist
  predictors:
    - graph:
        children: []
        implementation: TENSORFLOW_SERVER
        modelUri: "pvc://seldon-mnist-tf-9jkg7-modelpvc/"
        name: mnist-model
        parameters:
          - name: signature_name
            type: STRING
            value: predict_images
          - name: model_name
            type: STRING
            value: mnist-model
      name: default
      replicas: 1

I then did a kubectl exec -it -n kubeflow <podname> -c tfserving bash and ran ls -ltr /mnt/models and verified it was the same files listed in kubectl logs -n kubeflow <pod_name> tfserving-model-initializer. This worked for me on both 0.2.1 and the older v0.1.2

mattmbk commented 4 years ago

@ryandawsonuk Yes, apologies, I mean v0.1.2 as you said.

mattmbk commented 4 years ago

@ryandawsonuk Is the snapshot chart containing these changes published on https://storage.googleapis.com/seldon-charts ?

I fetched 0.5.1-SNAPSHOT and didn't see these changes in there yet. Thanks!

ryandawsonuk commented 4 years ago

Oh the image has been published but the chart might not have been. I installed from source version of the chart in the github repo

mattmbk commented 4 years ago

OK thanks, I was able to update.

@ryandawsonuk were you able to query the model server?

I'm still seeing errors in the tfserving container in my pod -- it looks like the storage-initializer is symlinking the pvc contents, but the pvc isn't mounted in the tfserving container?

ryandawsonuk commented 4 years ago

Am trying to check this but having some unrelated problems. Should be able to dig into it properly soon. Any further info you can offer on this is much appreciated.

ryandawsonuk commented 4 years ago

Yeah when I make the request I see tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:267] No versions of servable mnist-model found under base path /mnt/models and after doing a kubectl exec I see the files there as symlinks: image

But the /mnt/pvc path that they reference is not present.

I am seeing the same problem in the KFServing project that we based this on so I've raised the issue there.

With seldon it is possible to define the full Pod specification so the issue could be worked around like that. But do want get this working for the simpler TENSORFLOW_SERVER spec.

ryandawsonuk commented 4 years ago

Inspecting the yaml of the Pod, the initContainer has:

    volumeMounts:
    - mountPath: /mnt/pvc
      name: kfserving-pvc-source
      readOnly: true
    - mountPath: /mnt/models
      name: tfserving-provision-location

Whereas the tfserving container has:

    volumeMounts:
    - mountPath: /mnt/models
      name: tfserving-provision-location
      readOnly: true

This is presumably why it can't see the /mnt/pvc location. That mountPath is present in KFServing so I'll correct this in Seldon. Not entirely sure if it will resolve the issue (because I see the same error in KFServing) but it will take us a step forwards.

ryandawsonuk commented 4 years ago

The volumeMount paths now should be correct but still get the error, as I do with KFServing. But somebody there says they got the KFServing example working. That would be likely to work for a SeldonDeployment too but unfortunately I can't currently replicate what they did.

mattmbk commented 4 years ago

Great, it seems like this is working now. Thanks for the support!

ryandawsonuk commented 4 years ago

That’s good to hear, glad it’s working for you.

Hopefully we’ll be able to get an example into the docs for this.

ryandawsonuk commented 4 years ago

I created a pvc named seldon-mnist-tfjob-m68v5-modelpvc. I then followed the KFServing example and ran a TFJob to build a model. It's a bit long:

apiVersion: v1
data:
  batchSize: "100"
  exportDir: /mnt/export
  learningRate: "0.02"
  modelDir: /mnt
  name: tfjob-021
  pvcMountPath: /mnt
  pvcName: fengpvc
  trainSteps: "200"
kind: ConfigMap
metadata:
  name: mnist-map-training-4t25c985bg
  namespace: kubeflow
---
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tfjob-021
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - /usr/bin/python
                - /opt/model.py
                - --tf-model-dir=$(modelDir)
                - --tf-export-dir=$(exportDir)
                - --tf-train-steps=$(trainSteps)
                - --tf-batch-size=$(batchSize)
                - --tf-learning-rate=$(learningRate)
              env:
                - name: modelDir
                  value: /mnt
                - name: exportDir
                  value: /mnt/export
                - name: trainSteps
                  value: "200"
                - name: batchSize
                  value: "100"
                - name: learningRate
                  value: "0.02"
              image: ryandawsonuk/kftfmodel:0.0.1
              name: tensorflow
              volumeMounts:
                - mountPath: /mnt
                  name: local-storage
              workingDir: /opt
          restartPolicy: OnFailure
          volumes:
            - name: local-storage
              persistentVolumeClaim:
                claimName: seldon-mnist-tfjob-m68v5-modelpvc
    Ps:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - /usr/bin/python
                - /opt/model.py
                - --tf-model-dir=$(modelDir)
                - --tf-export-dir=$(exportDir)
                - --tf-train-steps=$(trainSteps)
                - --tf-batch-size=$(batchSize)
                - --tf-learning-rate=$(learningRate)
              env:
                - name: modelDir
                  value: /mnt
                - name: exportDir
                  value: /mnt/export
                - name: trainSteps
                  value: "200"
                - name: batchSize
                  value: "100"
                - name: learningRate
                  value: "0.02"
              image: ryandawsonuk/kftfmodel:0.0.1
              name: tensorflow
              volumeMounts:
                - mountPath: /mnt
                  name: local-storage
              workingDir: /opt
          restartPolicy: OnFailure
          volumes:
            - name: local-storage
              persistentVolumeClaim:
                claimName: seldon-mnist-tfjob-m68v5-modelpvc
    Worker:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - command:
                - /usr/bin/python
                - /opt/model.py
                - --tf-model-dir=$(modelDir)
                - --tf-export-dir=$(exportDir)
                - --tf-train-steps=$(trainSteps)
                - --tf-batch-size=$(batchSize)
                - --tf-learning-rate=$(learningRate)
              env:
                - name: modelDir
                  value: /mnt
                - name: exportDir
                  value: /mnt/export
                - name: trainSteps
                  value: "200"
                - name: batchSize
                  value: "100"
                - name: learningRate
                  value: "0.02"
              image: ryandawsonuk/kftfmodel:0.0.1
              name: tensorflow
              volumeMounts:
                - mountPath: /mnt
                  name: local-storage
              workingDir: /opt
          restartPolicy: OnFailure
          volumes:
            - name: local-storage
              persistentVolumeClaim:
                claimName: seldon-mnist-tfjob-m68v5-modelpvc

I then ran this using:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: tfserving
  namespace: kubeflow
spec:
  name: mnist
  predictors:
    - graph:
        children: []
        implementation: TENSORFLOW_SERVER
        modelUri: "pvc://seldon-mnist-tfjob-m68v5-modelpvc/export"
        name: mnist-model
        parameters:
          - name: signature_name
            type: STRING
            value: predict_images
          - name: model_name
            type: STRING
            value: mnist-model
      name: default
      replicas: 1

I think it might be important that it uses a path (/export).