tensorflow-serving ROCK integration tests fail

As we can see in the ROCKs integrate PR, the tests that use this server fail.

Debugging

What I 've observed until now

docker run

Both upstream image and ROCK have (approx) the same behaviour when doing docker run

╰─$ docker run tensorflow/serving:2.1.0                         
2024-01-15 16:03:43.701551: I tensorflow_serving/model_servers/server.cc:86] Building single TensorFlow model file config:  model_name: model model_base_path: /models/model
2024-01-15 16:03:43.701845: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2024-01-15 16:03:43.701855: I tensorflow_serving/model_servers/server_core.cc:573]  (Re-)adding model: model
2024-01-15 16:03:43.701992: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:362] FileSystemStoragePathSource encountered a filesystem access error: Could not find base path /models/model for servable model

╰─$ docker run charmedkubeflow/tensorflow-serving:2.13.0-b99a1d5
2024-01-15T16:04:15.170Z [pebble] Started daemon.
2024-01-15T16:04:15.177Z [pebble] POST /v1/services 6.265239ms 202
2024-01-15T16:04:15.177Z [pebble] Started default services with change 1.
2024-01-15T16:04:15.180Z [pebble] Service "tensorflow-serving" starting: bash -c 'tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"'
2024-01-15T16:04:15.254Z [tensorflow-serving] 2024-01-15 16:04:15.254664: I external/org_tensorflow/tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-15T16:04:15.293Z [tensorflow-serving] 2024-01-15 16:04:15.293029: I tensorflow_serving/model_servers/server.cc:74] Building single TensorFlow model file config:  model_name: model model_base_path: /models/model
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294340: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294352: I tensorflow_serving/model_servers/server_core.cc:594]  (Re-)adding model: model
2024-01-15T16:04:15.294Z [tensorflow-serving] 2024-01-15 16:04:15.294922: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:353] FileSystemStoragePathSource encountered a filesystem access error: Could not find base path /models/model for servable model with error NOT_FOUND: /models/model not found

However, when charm uses the ROCK for the server and we apply the tf-serving or hpt CRs, those SeldonDeployments do not behave as expected. As a result, their tests time out since they cannot extract a prediction.

╰─$ kl hpt-default-0-classifier-6b9fc6cfbf-bzfqq -c classifier
error: unknown flag `port'

This port args is passed by the ROCK itself, which is how it is done in upstream too though

╰─$ kl hpt-default-0-classifier-6b9fc6cfbf-bzfqq --all-containers                                             1 ↵
2024/01/15 16:21:40 NOTICE: Config file "/.rclone.conf" not found - using defaults
2024/01/15 16:21:41 INFO  : 00000123/saved_model.pb: Copied (new)
2024/01/15 16:21:42 INFO  : 00000123/variables/variables.data-00000-of-00001: Copied (new)
2024/01/15 16:21:42 INFO  : 00000123/variables/variables.index: Copied (new)
2024/01/15 16:21:42 INFO  : 00000123/assets/foo.txt: Copied (new)
2024/01/15 16:21:42 INFO  : 
Transferred:       12.058 KiB / 12.058 KiB, 100%, 0 B/s, ETA -
Transferred:            4 / 4, 100%
Elapsed time:         1.6s

error: unknown flag `port'
{"level":"info","ts":1705335761.3749456,"logger":"entrypoint","msg":"Full health checks ","value":false}
{"level":"info","ts":1705335761.3751297,"logger":"entrypoint.maxprocs","msg":"maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined"}
{"level":"info","ts":1705335761.3751352,"logger":"entrypoint","msg":"Hostname unset will use localhost"}
{"level":"info","ts":1705335761.3769522,"logger":"entrypoint","msg":"Starting","worker":1}
{"level":"info","ts":1705335761.3769732,"logger":"entrypoint","msg":"Starting","worker":2}
{"level":"info","ts":1705335761.376975,"logger":"entrypoint","msg":"Starting","worker":3}
{"level":"info","ts":1705335761.3769767,"logger":"entrypoint","msg":"Starting","worker":4}
{"level":"info","ts":1705335761.3769782,"logger":"entrypoint","msg":"Starting","worker":5}
{"level":"info","ts":1705335761.37698,"logger":"entrypoint","msg":"Starting","worker":6}
{"level":"info","ts":1705335761.3769813,"logger":"entrypoint","msg":"Starting","worker":7}
{"level":"info","ts":1705335761.376983,"logger":"entrypoint","msg":"Starting","worker":8}
{"level":"info","ts":1705335761.3769846,"logger":"entrypoint","msg":"Starting","worker":9}
{"level":"info","ts":1705335761.376987,"logger":"entrypoint","msg":"Starting","worker":10}
{"level":"info","ts":1705335761.3774252,"logger":"entrypoint","msg":"Running http server ","port":8000}
{"level":"info","ts":1705335761.3774323,"logger":"entrypoint","msg":"Creating non-TLS listener","port":8000}
{"level":"info","ts":1705335761.3775222,"logger":"entrypoint","msg":"Running grpc server ","port":5001}
{"level":"info","ts":1705335761.377525,"logger":"entrypoint","msg":"Creating non-TLS listener","port":5001}
{"level":"info","ts":1705335761.377585,"logger":"entrypoint","msg":"Setting max message size ","size":2147483647}
{"level":"info","ts":1705335761.3777068,"logger":"entrypoint","msg":"gRPC server started"}
{"level":"info","ts":1705335761.3780322,"logger":"SeldonRestApi","msg":"Listening","Address":"0.0.0.0:8000"}
{"level":"info","ts":1705335761.3780477,"logger":"entrypoint","msg":"http server started"}
{"level":"error","ts":1705335781.3396814,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}
{"level":"error","ts":1705335782.2400084,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}

Looking at the seldon-core logs, I don't see anything that doesn't look expected but I 'm attaching those here for reference seldon-core container logs.txt. It logs the following reconciler errror, but after that, it seems to reconcile without errors.

Failed to update InferenceService status","SeldonDeployment":"default/hpt"

Reproduce

Using published image charmedkubeflow/tensorflow-serving:2.13.0-b99a1d5 , replace in images-list (configmap__predictor__tensorflow__tensorflow) or directly in the charm's configmap (TENSORFLOW_SERVER.protocols.tensorflow fields)
Run either tox -e seldon-servers-integration -- --model testing -k tensorflow or tox -e seldon-servers-integration -- --model testing -k tf-serving

Environment

Juju 3.1 Microk8s 1.26

Note

It looks like tests had passed in ROCKs repo and the ROCK was published because we hadn't configured the tests properly

canonical / seldonio-rocks