Open orfeas-k opened 9 months ago
Tried running the tests EKS too. The situation was bit (but not much) different here.
The deployment goes to Ready 1/1
with the following conditions
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2023-10-04T11:12:04Z"
lastUpdateTime: "2023-10-04T11:12:04Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2023-10-04T11:10:24Z"
lastUpdateTime: "2023-10-04T11:12:04Z"
message: ReplicaSet "seldon-model-1-example-0-classifier-5b844bbc69" has successfully
progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 2249
readyReplicas: 1
replicas: 1
updatedReplicas: 1
However, observedGeneration
keeps increasing forever. This results (I guess) in the SeldonDeployment (the custom resource that created the aforementioned deployment) to get stuck at state: Creating
.
I can see a bunch of those here too
2023-10-04T12:48:53.557474818Z stdout F 2023-10-04T12:48:53.557Z [seldon-core] {"level":"error","ts":1696423733.5572436,"logger":"controller.seldon-controller-manager","msg":"Reconciler error","reconciler group":"machinelearning.seldon.io","reconciler kind":"SeldonDeployment","name":"seldon-model-1","namespace":"kubeflow","error":"Operation cannot be fulfilled on deployments.apps \"seldon-model-1-example-0-classifier\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.2/pkg/internal/controller/controller.go:227"}
Tried applying the same seldondeployment
into a newly created namespace and deployments goes to Ready
there too, Thus, we conclude that the deployment will work in any namespace that is not the model namespace.
For now, we 've implemented a workaround in our tests in #220 where we apply seldondeployments
to the namespace default
.
To summarize the above, in the case that seldondeployments
are applied to the namespace created by juju model and fail to go into state: ready
, one should create and apply them to a different namespace than the one created by juju
(kubeflow
for CKF deployments)
We ran into this today with seldon-core 1.16.0
.
The Deployment is also stuck in Progressing state and keeping the old replica set with a pod running.
conditions:
- type: Available
status: 'True'
lastUpdateTime: '2023-10-17T14:12:11Z'
lastTransitionTime: '2023-10-17T14:12:11Z'
reason: MinimumReplicasAvailable
message: Deployment has minimum availability.
- type: Progressing
status: 'True'
lastUpdateTime: '2023-10-25T13:42:27Z'
lastTransitionTime: '2023-10-17T12:52:30Z'
reason: NewReplicaSetCreated
message: >-
Created new replica set
"model-name-predictor-0-model-5dbd4c788d"
For us the reason is a bit different as NewReplicaSetCreated
.
Deleting the old ReplicaSet does not get in unstuck.
Bug Description
During the effort of updating seldon charm to
1.17.1
at https://github.com/canonical/seldon-core-operator/pull/216, we bumped across the following bug when running the integration tests.The deployment created by applied seldondeployment, gets stuck with a condition
ReplicaSet "X" is progressing
(as a result, it's never ready). The issue though is that underlying ReplicaSet creates a pod successfully and its status showsreadeyReplicas: 1
. At the same time, the deployment has anobservedGeneration
of 7 (or sth like that) while theReplicaSet
has itsobservedGeneration
set to1
.This looks like the same issue we 've hit when trying to update Seldon ROCKs described in this issue's comments https://github.com/canonical/seldonio-rocks/issues/37#issuecomment-1716758029.
Debugging
During debugging we tried to apply manually with
kubectl apply -f yaml
the aforementioned seldondeployment and noticed that:default
namespace, deployment was progressing successfullyHere's the namespaces' yaml outputs
and
Tests run in the namespace created by a test juju model. In that namespace, they try to apply a custom resource, which in turn creates a deployment. The issue is that, while the ReplicaSet creates a pod (successfully) and its status shows
readeyReplicas: 1
, the deployment gets stuck with a conditionReplicaSet "X" is progressing
(as a result, it's never ready). Same thing happens if I try to apply manually the custom resource in the testing namespace too. However, if I apply this todefault
namespace, the Deployment goes toReady
as expected.To Reproduce
2.9.45
and Microk8s1.24
(tried also with 1.26 but issue persists)tox -e charm-integration
ortox -e seldon-servers-integration
Environment
Relevant log output
Additional context
No response