Closed shawkins closed 2 years ago
Checked locally again, this seems to look good. For a developer instance you see the creation of deployments:
name-admin-server name-canary name-cruisecontrol name-exporter name-kafka name-zookeeper
Each creating just a single replica of a pod that has resources matching the real thing.
I will take a look tomorrow.
I checked with my "kas-installer" based install modifying the sync to mimic the control plane to add the label, and I verified that all the needed deployments are created in reserved
mode, and they have relinquished the space when a real instance needed the resources.
Only one code change request I have is instead of "pausepod" or "pausedeployment" can we generally use "reservedpod" or "reserveddeployment" and leave the technical name of "pause pod" just to the image name?
Only one code change request I have is instead of "pausepod" or "pausedeployment" can we generally use "reservedpod" or "reserveddeployment" and leave the technical name of "pause pod" just to the image name?
Done - now actually done. My amended commit didn't quite make it yesterday. Now it has, but something odd happened in the history - which won't matter once this is squashed. There was also a copy paste error with the cruise control deployment that has been corrected.
@machi1990 you can use kas-installer, in there in the operators directory there is script to generate the bundle, you also need to override the ENV property KAS_FLEETSHARD_OLM_INDEX_IMAGE
@machi1990 you can use kas-installer, in there in the operators directory there is script to generate the bundle, you also need to override the ENV property KAS_FLEETSHARD_OLM_INDEX_IMAGE
Perfect, thanks. I didn't know if we have the ability to generate the bundle in kas-installer.
What are the timelines for this one being merged and released?
@machi1990 Mike has worked the bundling issue for the priority class, so I think we'll merge this soon. Do you want to make a further determination on the timeout issue you mentioned in chat first? Assuming that we're good this should go into .27, but I don't know a date for that.
What are the timelines for this one being merged and released?
Mike has worked the bundling issue for the priority class, so I think we'll merge this soon. Assuming that we're good this should go into .27, but I don't know a date for that.
Thanks.
@machi1990 Do you want to make a further determination on the timeout issue you mentioned in chat first?
Thanks. I've replied via chat, and I do not also think the issue is because of reserved capacity but more generally could be because of autoscaling (node took longer than 420000ms to come up for ZK to be provisioned). So the issue doesn't have to hold this PR. I'll try to reproduce the scenario and share the exact message with you.
If there are no objections, I'll add another commit here to cover the blast radius expectation.
If there are no objections, I'll add another commit here to cover the blast radius expectation.
I don't object, but isn't this change pretty well contained already? I.e. deploying the operator with this wouldn't cause rolling pods, etc.
I don't object, but isn't this change pretty well contained already? I.e. deploying the operator with this wouldn't cause rolling pods, etc.
Sorry, I put this comment on the wrong pr. It should be on the other one that is enforcing the max.
Updating the pr to be independent of capacity reporting. If the operator sees a deployment marked as reserved, then it won't actually create the expected resources. Instead deployments with pause pods at a lower priority will be created with resources and replicas that match the expected instances. The parent managedkafka will be marked as ready once all of these deployments are ready.
The priority class kas-fleetshard-reservation (which I'm open to changing) was added to the kubernetes.yml in the project as we need it to be part of the workload, which diverges from the kas-fleetshard-high used in the bundle for the operator components.
There is no interaction with metrics as those are derived from kafka resources.
This change has no expectation that profile specific machine pools will be in use.
I'll run this on my local cluster again to make sure I didn't miss anything pulling this over to main.