Mirantis / hmc

Apache License 2.0
23 stars 18 forks source link

Adding/removing providers in management object results in deletion of managed clusters. #703

Open a13x5 opened 1 day ago

a13x5 commented 1 day ago

Issue

When changing providers (adding or removing) the helm releases connected to managed clusters are being deleted by management controller and then immediately recreated by managedcluster controller.

Logs:

2024-12-02T20:05:55Z    INFO    Reconciling Management  {"controller": "management", "controllerGroup": "hmc.mirantis.com", "controllerKind": "Management", "Management": {"name":"hmc"}, "namespace": "", "name": "hmc", "reconcileID": "5d0f6d25-dc34-4144-80c5-2d6798e928e9"}
2024-12-02T20:05:55Z    INFO    Found component to remove   {"controller": "management", "controllerGroup": "hmc.mirantis.com", "controllerKind": "Management", "Management": {"name":"hmc"}, "namespace": "", "name": "hmc", "reconcileID": "5d0f6d25-dc34-4144-80c5-2d6798e928e9", "component_name": "vsphere-dev"}
2024-12-02T20:05:55Z    INFO    Removed HelmRelease {"controller": "management", "controllerGroup": "hmc.mirantis.com", "controllerKind": "Management", "Management": {"name":"hmc"}, "namespace": "", "name": "hmc", "reconcileID": "5d0f6d25-dc34-4144-80c5-2d6798e928e9", "reference": "hmc-system/vsphere-dev"}
2024-12-02T20:05:55Z    INFO    Reconciling ManagedCluster  {"controller": "managedcluster", "controllerGroup": "hmc.mirantis.com", "controllerKind": "ManagedCluster", "ManagedCluster": {"name":"vsphere-dev","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "vsphere-dev", "reconcileID": "1046080e-c2a0-437d-ad5a-554072de12c1"}
2024-12-02T20:05:55Z    INFO    Downloading Helm chart  {"controller": "managedcluster", "controllerGroup": "hmc.mirantis.com", "controllerKind": "ManagedCluster", "ManagedCluster": {"name":"vsphere-dev","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "vsphere-dev", "reconcileID": "1046080e-c2a0-437d-ad5a-554072de12c1"}
2024/12/02 20:05:55 [DEBUG] GET http://source-controller.hmc-system.svc.cluster.local./helmchart/hmc-system/vsphere-standalone-cp-0-0-3/vsphere-standalone-cp-0.0.3.tgz
2024-12-02T20:05:55Z    INFO    Initializing Helm client    {"controller": "managedcluster", "controllerGroup": "hmc.mirantis.com", "controllerKind": "ManagedCluster", "ManagedCluster": {"name":"vsphere-dev","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "vsphere-dev", "reconcileID": "1046080e-c2a0-437d-ad5a-554072de12c1"}
2024-12-02T20:05:55Z    INFO    Validating Helm chart with provided values  {"controller": "managedcluster", "controllerGroup": "hmc.mirantis.com", "controllerKind": "ManagedCluster", "ManagedCluster": {"name":"vsphere-dev","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "vsphere-dev", "reconcileID": "1046080e-c2a0-437d-ad5a-554072de12c1"}

Repro steps

  1. Create managed cluster (I created vsphere cluster), ensure that helm release is installed properly and Cluster object is created properly
  2. Edit management cluster and add or remove unused provider (I removed and added cluster-api-provider-aws)
  3. Check helmrelease related to managed cluster. Note that creation time was changed and release was recreated.

Additional note: When installed in airgap environment a number of public cloud controllers were not initialized properly (on purpose) by no passing airgap flag. This caused deletion and recreation happening every 10 seconds for all ManagedClusters. Unfortunately I couldn't reproduce it properly.

Conclusions

Most probably caused by #584 . It looks like selector is too broad and we should avoid deleting helm releases related to managed cluster. Also most probably this will not affect managed clusters created in namespaces other than hmc-system (not tested).

zerospiel commented 22 hours ago

The repro steps suggest that an unused provider has been deleted. According to the mentioned #584 and unmentioned #574 PR where the validation occurs, it is the expected behavior. If the removed provider does not have any clustertemplates or has some that are utilized by none of managedclusters objects, then it is allowed to remove the provider and drop helm releases whose names are not equal to the left components (providers) names (excluding the core's releases). There are integration tests for this particular feature of the management-ctrl.

UPD. The assumption regarding the hmc-system namespace is correct: only the releases in the system namespace are being selected during the removal check, no need to test it.

In particular case: I suppose the vsphere-dev has not been included in the list of the management.providers list, hence it is being removed. There are no labels available on the helm releases installed by the template-ctrl, that could narrow the selector.

At the time of implementation, there was no airgap feature, JIC.

As a suggestion: add the new component to the list of the providers (I believe it is why the section exists in the first place), or the template-ctrl could add extra specific labels during create/update of helmreleases objects, that could be utilized during that removal check in the mgmt-ctrl.

a13x5 commented 18 hours ago

I will disagree. I'm removing AWS provider and my cluster, created with vSphere provider gets deleted.And then recreated. How this could be expected? The removal of provider shouldn't result removal of all clusters in hmc-system namespace (for all providers). This behavior is confusing (to say the least). And adding managedclusters as a providers in the Management spec is even more confusing. If we think that this UX is ok, we should explicitly document that behavior, since it's not standard and it's not something that could be expected.

zerospiel commented 18 hours ago

I do not quite understand what you disagree with. I've checked by myself, and there is the only suggestion left from the already mentioned: add extra labels

a13x5 commented 11 hours ago

@zerospiel Sorry, probably I misread the original message - had an impression that you're telling that this is normal expected behavior.