MGDSTRM-9230 removing the drain cleaner pdb for a single replica

bf2fc6cc711aee1a0c2a / kas-fleetshard

The kas-fleetshard-operator is responsible for provisioning and managing instances of kafka on a cluster. The kas-fleetshard-synchronizer synchronizes the state of a fleet shard with the kas-fleet-manager.

Apache License 2.0

7 stars 20 forks source link

MGDSTRM-9230 removing the drain cleaner pdb for a single replica #789

Closed shawkins closed 2 years ago

sonarcloud[bot] commented 2 years ago

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

100.0% Coverage
0.0% Duplication

shawkins commented 2 years ago

@MikeEdgar @fvaleri @rareddy I was checking staging and production to confirm the current DisruptionsAllowed and it appears that the drain cleaner webhook is not seen as installed - that is the kafka resources do not have an explicit PDB with maxUnavailable set to 0. However I do see the drain cleaner webhook configuration, so I'm not sure what is going on. Can someone confirm we do indeed have an issue?

MikeEdgar commented 2 years ago

My theory (untested) is that the order of add/delete events for the webhook during a bundle upgrade is such that the value of drainCleanerWebhookFound ends up as false.

Perhaps when there is a change event we should list them and set the boolean if the result is non-empty.

shawkins commented 2 years ago

Perhaps when there is a change event we should list them and set the boolean if the result is non-empty.

If it's possible that there can be more than 1 of these at a time, then yes we could have a timing issue here - you'd have something like the event sequence add 1, add 2, delete 1, which would leave us in the false state.

Perhaps when there is a change event we should list them and set the boolean if the result is non-empty.

Yes the informer cache can be consulted on delete.

I'll log a JIRA - MGDSTRM-9388

rareddy commented 2 years ago

@shawkins, @MikeEdgar show we devise an alert when webhook is missing? we expect a drain cleaner always be available correct?

MikeEdgar commented 2 years ago

we expect a drain cleaner always be available correct?

At this point, yes. I'm not quite grasping the case where drain cleaner would not be present for developer instances (need to reread the thread above).

shawkins commented 2 years ago

show we devise an alert when webhook is missing? we expect a drain cleaner always be available correct?

Are there similar alerts for the other bundle components - operators, sync?

At this point, yes. I'm not quite grasping the case where drain cleaner would not be present for developer instances (need to reread the thread above).

I'm gleaning that @fvaleri preferred solution longer-term in a production developer cluster is to not have the draincleaner installed at all, or at least the webhook uninstalled - as there's no value to having it running in that environment.

rareddy commented 2 years ago

Are there similar alerts for the other bundle components - operators, sync?

No, but with Sync, FSO we will see other stuff start failing, they are not enabled with an external trigger as drain cleaner is IMO.

as there's no value to having it running in that environment.

I am not sure about the amount of resource savings, but I would favor a consistent deployment model over small savings.

MikeEdgar commented 2 years ago

as there's no value to having it running in that environment.

I am not sure about the amount of resource savings, but I would favor a consistent deployment model over small savings.

I agree, I would first want to look at proposing some selection criteria in drain cleaner to skip particular pods if necessary.

shawkins commented 2 years ago

No, but with Sync, FSO we will see other stuff start failing, they are not enabled with an external trigger as drain cleaner is IMO.

The validating webhook configuration is part of the bundle, just like the operator deployments. So I'm just trying to clarify are you looking for some kind of alert that nothing from the bundle has been inadvertently deleted.

Or are you saying that you want a metric to be exposed from the FSO about whether it thinks the drain cleaner validating webhook configuration is installed, so that you can have an alert based upon that?

I agree, I would first want to look at proposing some selection criteria in drain cleaner to skip particular pods if necessary.

We should separate out the longer-term issue. I think we can move forward with this for now.

MikeEdgar commented 2 years ago

We should separate out the longer-term issue. I think we can move forward with this for now.

+1 - is this for 0.27.0?

shawkins commented 2 years ago

+1 - is this for 0.27.0?

I'd vote yes.

rareddy commented 2 years ago

I thought webhook configuration is not part of the bundle but based on some environment specific config in OSD

FSO about whether it thinks the drain cleaner validating webhook configuration is installed

I was thinking about this scenario, given above I am not sure it makes sense anymore. can you give the link to the bundle config for my understanding? thanks

shawkins commented 2 years ago

I was thinking about this scenario, given above I am not sure it makes sense anymore. can you give the link to the bundle config for my understanding? thanks

https://gitlab.cee.redhat.com/mk-ci-cd/rhosak-pipeline-configs/-/blob/rhosak-0.1-rhel-8/distgit/containers/rhosak-kas-strimzi-operator-bundle/src/templates/kas-strimzi-bundle.clusterserviceversion.yaml#L57

fvaleri commented 2 years ago

Just for the record: my preference is not for saving some resources, but to avoid the race between K8s draining and drain cleaner, which may cause 2 rolling restarts (not a big issue for single-broker instances). If there is another way to avoid this without splitting the bundle (maybe a skip flag based on an env var), that would be the best.

rareddy commented 2 years ago

@fvaleri with @shawkins modification this won't be enabled in a single broker environment avoiding the issue you mention. for standard sizes no change in current behavior. I am failing to see the concern, can you please explain

fvaleri commented 2 years ago

@rareddy this change avoids the creation of the PDB required to block K8s draining when using the drain cleaner. Without the PDB, the pod eviction notifications will still reach the drain cleaner, which will annotate involved Kafka and ZK pods for restart.

shawkins commented 2 years ago

https://issues.redhat.com/browse/MGDSTRM-9409 was created to address more completely.