canonical / seldonio-rocks

ROCKs for Seldon Core
Apache License 2.0
0 stars 1 forks source link

tensorflow-serving ROCK integration tests fail #83

Closed orfeas-k closed 9 months ago

orfeas-k commented 9 months ago

As we can see in the ROCKs integrate PR, the tests that use this server fail.

Debugging

What I 've observed until now

docker run

Both upstream image and ROCK have (approx) the same behaviour when doing docker run

╰─$ docker run tensorflow/serving:2.1.0                         
2024-01-15 16:03:43.701551: I tensorflow_serving/model_servers/server.cc:86] Building single TensorFlow model file config:  model_name: model model_base_path: /models/model
2024-01-15 16:03:43.701845: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2024-01-15 16:03:43.701855: I tensorflow_serving/model_servers/server_core.cc:573]  (Re-)adding model: model
2024-01-15 16:03:43.701992: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:362] FileSystemStoragePathSource encountered a filesystem access error: Could not find base path /models/model for servable model

╰─$ docker run charmedkubeflow/tensorflow-serving:2.13.0-b99a1d5
2024-01-15T16:04:15.170Z [pebble] Started daemon.
2024-01-15T16:04:15.177Z [pebble] POST /v1/services 6.265239ms 202
2024-01-15T16:04:15.177Z [pebble] Started default services with change 1.
2024-01-15T16:04:15.180Z [pebble] Service "tensorflow-serving" starting: bash -c 'tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"'
2024-01-15T16:04:15.254Z [tensorflow-serving] 2024-01-15 16:04:15.254664: I external/org_tensorflow/tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-15T16:04:15.293Z [tensorflow-serving] 2024-01-15 16:04:15.293029: I tensorflow_serving/model_servers/server.cc:74] Building single TensorFlow model file config:  model_name: model model_base_path: /models/model
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294340: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294352: I tensorflow_serving/model_servers/server_core.cc:594]  (Re-)adding model: model
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294922: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:353] FileSystemStoragePathSource encountered a filesystem access error: Could not find base path /models/model for servable model with error NOT_FOUND: /models/model not found

However, when charm uses the ROCK for the server and we apply the tf-serving or hpt CRs, those SeldonDeployments do not behave as expected. As a result, their tests time out since they cannot extract a prediction.

╰─$ kl hpt-default-0-classifier-6b9fc6cfbf-bzfqq -c classifier
error: unknown flag `port'

This port args is passed by the ROCK itself, which is how it is done in upstream too though

╰─$ kl hpt-default-0-classifier-6b9fc6cfbf-bzfqq --all-containers                                             1 ↵
2024/01/15 16:21:40 NOTICE: Config file "/.rclone.conf" not found - using defaults
2024/01/15 16:21:41 INFO  : 00000123/saved_model.pb: Copied (new)
2024/01/15 16:21:42 INFO  : 00000123/variables/variables.data-00000-of-00001: Copied (new)
2024/01/15 16:21:42 INFO  : 00000123/variables/variables.index: Copied (new)
2024/01/15 16:21:42 INFO  : 00000123/assets/foo.txt: Copied (new)
2024/01/15 16:21:42 INFO  : 
Transferred:       12.058 KiB / 12.058 KiB, 100%, 0 B/s, ETA -
Transferred:            4 / 4, 100%
Elapsed time:         1.6s

error: unknown flag `port'
{"level":"info","ts":1705335761.3749456,"logger":"entrypoint","msg":"Full health checks ","value":false}
{"level":"info","ts":1705335761.3751297,"logger":"entrypoint.maxprocs","msg":"maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined"}
{"level":"info","ts":1705335761.3751352,"logger":"entrypoint","msg":"Hostname unset will use localhost"}
{"level":"info","ts":1705335761.3769522,"logger":"entrypoint","msg":"Starting","worker":1}
{"level":"info","ts":1705335761.3769732,"logger":"entrypoint","msg":"Starting","worker":2}
{"level":"info","ts":1705335761.376975,"logger":"entrypoint","msg":"Starting","worker":3}
{"level":"info","ts":1705335761.3769767,"logger":"entrypoint","msg":"Starting","worker":4}
{"level":"info","ts":1705335761.3769782,"logger":"entrypoint","msg":"Starting","worker":5}
{"level":"info","ts":1705335761.37698,"logger":"entrypoint","msg":"Starting","worker":6}
{"level":"info","ts":1705335761.3769813,"logger":"entrypoint","msg":"Starting","worker":7}
{"level":"info","ts":1705335761.376983,"logger":"entrypoint","msg":"Starting","worker":8}
{"level":"info","ts":1705335761.3769846,"logger":"entrypoint","msg":"Starting","worker":9}
{"level":"info","ts":1705335761.376987,"logger":"entrypoint","msg":"Starting","worker":10}
{"level":"info","ts":1705335761.3774252,"logger":"entrypoint","msg":"Running http server ","port":8000}
{"level":"info","ts":1705335761.3774323,"logger":"entrypoint","msg":"Creating non-TLS listener","port":8000}
{"level":"info","ts":1705335761.3775222,"logger":"entrypoint","msg":"Running grpc server ","port":5001}
{"level":"info","ts":1705335761.377525,"logger":"entrypoint","msg":"Creating non-TLS listener","port":5001}
{"level":"info","ts":1705335761.377585,"logger":"entrypoint","msg":"Setting max message size ","size":2147483647}
{"level":"info","ts":1705335761.3777068,"logger":"entrypoint","msg":"gRPC server started"}
{"level":"info","ts":1705335761.3780322,"logger":"SeldonRestApi","msg":"Listening","Address":"0.0.0.0:8000"}
{"level":"info","ts":1705335761.3780477,"logger":"entrypoint","msg":"http server started"}
{"level":"error","ts":1705335781.3396814,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}
{"level":"error","ts":1705335782.2400084,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}

Looking at the seldon-core logs, I don't see anything that doesn't look expected but I 'm attaching those here for reference seldon-core container logs.txt. It logs the following reconciler errror, but after that, it seems to reconcile without errors.

Failed to update InferenceService status","SeldonDeployment":"default/hpt"

Reproduce

  1. Using published image charmedkubeflow/tensorflow-serving:2.13.0-b99a1d5 , replace in images-list (configmap__predictor__tensorflow__tensorflow) or directly in the charm's configmap (TENSORFLOW_SERVER.protocols.tensorflow fields)
  2. Run either tox -e seldon-servers-integration -- --model testing -k tensorflow or tox -e seldon-servers-integration -- --model testing -k tf-serving

Environment

Juju 3.1 Microk8s 1.26

Note

It looks like tests had passed in ROCKs repo and the ROCK was published because we hadn't configured the tests properly

orfeas-k commented 9 months ago

Turns out this was the same error that was described in https://github.com/canonical/kserve-rocks/issues/11#issuecomment-1845145149 and https://github.com/canonical/rockcraft/issues/382#issue-1952074241. The error message error: unknown flag 'port' was coming from the container's pebble service since it received arguments it didn't recognise. That was happening because they were passed by the created SeldonDeployment pod (see part of its yaml below)

...  
containers:
  - args:
    - --port=9500
    - --rest_api_port=9000
    - --model_name=classifier
    - --model_base_path=/mnt/models
    env:
...

This was fixed by using the entrypoint-service field alognside with [ args ] in the command field (see the change here). This way, arguments are passed to tensorflow-serving instead of pebble.