canonical / seldon-core-operator

Seldon Core Operator
Apache License 2.0
5 stars 10 forks source link

Use self-hosted runners to run integration tests #225

Closed kimwnasptd closed 11 months ago

kimwnasptd commented 11 months ago

What needs to get done

Update the GH action to run the tests via the self-hosted runners. This could potentially require:

We should also keep in mind of random issues like https://github.com/canonical/charmed-kubeflow-uats/issues/50

Why it needs to get done

We've seen the Seldon tests failing a lot of times, for reasons that we believe are due to resources https://github.com/canonical/seldon-core-operator/issues/203#issuecomment-1701479507

Hopefully by using self-hosted runners the tests won't be that flaky and will complete without issues.

orfeas-k commented 11 months ago

Tried to tackle this in https://github.com/canonical/seldon-core-operator/pull/226. Tests ran on self-hosted runners but SeldonDeployment mlflow-v1 didn't succeed and went to Failed.

From its yaml, we see that pod doesn't have minimum availability.

status:
  address:
    url: http://mlflow-default.default.svc.cluster.local:8000/api/v1.0/predictions
  conditions:
  - lastTransitionTime: "2023-11-27T08:08:37Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: DeploymentsReady
[..]
  - lastTransitionTime: "2023-11-27T08:18:39Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Ready
  - lastTransitionTime: "2023-11-27T08:08:37Z"
    reason: Not all services created
    status: "False"
    type: ServicesReady
  deploymentStatus:
    mlflow-default-0-classifier:
      replicas: 1
  description: Deployment is no longer progressing and not available.
  replicas

Taking a look at its logs, the logs are expected and at the bottom we see a bunch of

2023-11-27T08:23:39.2630534Z {"level":"error","ts":1701072568.0599566,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"}

This message is expected while the deployment is not ready yet.

From the above, we can't conclude with certainty the root cause of the issue but since there are no other specific logs, this could be due to the deployment not having enough resources.

Rejecting

Given all the above, we will reject this task and instead split the test into multiple ones. This way, we will run each server in its own environment in order to eliminate the possibility of servers failing due to resources. See https://github.com/canonical/seldon-core-operator/issues/229