Closed shawkins closed 2 years ago
@MikeEdgar @fvaleri @rareddy I was checking staging and production to confirm the current DisruptionsAllowed and it appears that the drain cleaner webhook is not seen as installed - that is the kafka resources do not have an explicit PDB with maxUnavailable set to 0. However I do see the drain cleaner webhook configuration, so I'm not sure what is going on. Can someone confirm we do indeed have an issue?
My theory (untested) is that the order of add/delete events for the webhook during a bundle upgrade is such that the value of drainCleanerWebhookFound
ends up as false
.
Perhaps when there is a change event we should list them and set the boolean if the result is non-empty.
Perhaps when there is a change event we should list them and set the boolean if the result is non-empty.
If it's possible that there can be more than 1 of these at a time, then yes we could have a timing issue here - you'd have something like the event sequence add 1, add 2, delete 1, which would leave us in the false state.
Perhaps when there is a change event we should list them and set the boolean if the result is non-empty.
Yes the informer cache can be consulted on delete.
I'll log a JIRA - MGDSTRM-9388
@shawkins, @MikeEdgar show we devise an alert when webhook is missing? we expect a drain cleaner always be available correct?
we expect a drain cleaner always be available correct?
At this point, yes. I'm not quite grasping the case where drain cleaner would not be present for developer instances (need to reread the thread above).
show we devise an alert when webhook is missing? we expect a drain cleaner always be available correct?
Are there similar alerts for the other bundle components - operators, sync?
At this point, yes. I'm not quite grasping the case where drain cleaner would not be present for developer instances (need to reread the thread above).
I'm gleaning that @fvaleri preferred solution longer-term in a production developer cluster is to not have the draincleaner installed at all, or at least the webhook uninstalled - as there's no value to having it running in that environment.
Are there similar alerts for the other bundle components - operators, sync?
No, but with Sync, FSO we will see other stuff start failing, they are not enabled with an external trigger as drain cleaner is IMO.
as there's no value to having it running in that environment.
I am not sure about the amount of resource savings, but I would favor a consistent deployment model over small savings.
as there's no value to having it running in that environment.
I am not sure about the amount of resource savings, but I would favor a consistent deployment model over small savings.
I agree, I would first want to look at proposing some selection criteria in drain cleaner to skip particular pods if necessary.
No, but with Sync, FSO we will see other stuff start failing, they are not enabled with an external trigger as drain cleaner is IMO.
The validating webhook configuration is part of the bundle, just like the operator deployments. So I'm just trying to clarify are you looking for some kind of alert that nothing from the bundle has been inadvertently deleted.
Or are you saying that you want a metric to be exposed from the FSO about whether it thinks the drain cleaner validating webhook configuration is installed, so that you can have an alert based upon that?
I agree, I would first want to look at proposing some selection criteria in drain cleaner to skip particular pods if necessary.
We should separate out the longer-term issue. I think we can move forward with this for now.
We should separate out the longer-term issue. I think we can move forward with this for now.
+1 - is this for 0.27.0?
+1 - is this for 0.27.0?
I'd vote yes.
I thought webhook configuration is not part of the bundle but based on some environment specific config in OSD
FSO about whether it thinks the drain cleaner validating webhook configuration is installed
I was thinking about this scenario, given above I am not sure it makes sense anymore. can you give the link to the bundle config for my understanding? thanks
I was thinking about this scenario, given above I am not sure it makes sense anymore. can you give the link to the bundle config for my understanding? thanks
+0
Just for the record: my preference is not for saving some resources, but to avoid the race between K8s draining and drain cleaner, which may cause 2 rolling restarts (not a big issue for single-broker instances). If there is another way to avoid this without splitting the bundle (maybe a skip flag based on an env var), that would be the best.
@fvaleri with @shawkins modification this won't be enabled in a single broker environment avoiding the issue you mention. for standard sizes no change in current behavior. I am failing to see the concern, can you please explain
@rareddy this change avoids the creation of the PDB required to block K8s draining when using the drain cleaner. Without the PDB, the pod eviction notifications will still reach the drain cleaner, which will annotate involved Kafka and ZK pods for restart.
https://issues.redhat.com/browse/MGDSTRM-9409 was created to address more completely.
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells
100.0% Coverage
0.0% Duplication