Closed kimwnasptd closed 11 months ago
Tried to tackle this in https://github.com/canonical/seldon-core-operator/pull/226. Tests ran on self-hosted runners but SeldonDeployment mlflow-v1 didn't succeed and went to Failed.
Implemented more logging in order to get the failed pod's logs and the failed SeldonDeployment yaml
output with
- name: Get seldondeployments
run: kubectl get seldondeployments -A -o yaml
if: failure()
- name: Get logs from pods in `default` namespace
run: kubectl get pods -n default | tail -n +2 | awk '{print $1}' | xargs -n1 kubectl -n default logs --all-containers
if: failure()
This is the run with logs for future reference and these are the logs.zip.
From its yaml, we see that pod doesn't have minimum availability.
status:
address:
url: http://mlflow-default.default.svc.cluster.local:8000/api/v1.0/predictions
conditions:
- lastTransitionTime: "2023-11-27T08:08:37Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: DeploymentsReady
[..]
- lastTransitionTime: "2023-11-27T08:18:39Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Ready
- lastTransitionTime: "2023-11-27T08:08:37Z"
reason: Not all services created
status: "False"
type: ServicesReady
deploymentStatus:
mlflow-default-0-classifier:
replicas: 1
description: Deployment is no longer progressing and not available.
replicas
Taking a look at its logs, the logs are expected and at the bottom we see a bunch of
2023-11-27T08:23:39.2630534Z {"level":"error","ts":1701072568.0599566,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}
This message is expected while the deployment is not ready yet.
From the above, we can't conclude with certainty the root cause of the issue but since there are no other specific logs, this could be due to the deployment not having enough resources.
Given all the above, we will reject this task and instead split the test into multiple ones. This way, we will run each server in its own environment in order to eliminate the possibility of servers failing due to resources. See https://github.com/canonical/seldon-core-operator/issues/229
What needs to get done
Update the GH action to run the tests via the self-hosted runners. This could potentially require:
lightkube
to not trust env varsWe should also keep in mind of random issues like https://github.com/canonical/charmed-kubeflow-uats/issues/50
Why it needs to get done
We've seen the Seldon tests failing a lot of times, for reasons that we believe are due to resources https://github.com/canonical/seldon-core-operator/issues/203#issuecomment-1701479507
Hopefully by using self-hosted runners the tests won't be that flaky and will complete without issues.