clusternet / clusternet

[CNCF Sandbox Project] Managing your Kubernetes clusters (including public, private, edge, etc.) as easily as visiting the Internet
https://clusternet.io
Apache License 2.0
1.36k stars 197 forks source link

handle helmrelease may uninstall chart that is pending-install or pending-upgrade #592

Open wl-chen opened 1 year ago

wl-chen commented 1 year ago

What happened:

Charts may be uninstalled probabilistically when installing or upgrading.

What you expected to happen:

Charts can not be uninstalled probabilistically when installing or upgrading.

How to reproduce it (as minimally and precisely as possible):

This event is sporadic, but when viewing the source code, the guess may be related to this part of the code https://github.com/clusternet/clusternet/blob/main/pkg/utils/deployer.go#L193. When one of helmreleases that rendered by description install or upgrade failed, both description-controller and helmrelrease-controller will re-execute the handle helmrelease logic. The code just mentioned will mistakenly delete the chart being executed.

When installing helmrelease by clusternet-hub, the above errors will cause helmrelease to generate the following events: an error occurred while uninstalling the release. original install error: services "service-xxx" not found: uninstall: Release not loaded: chart-xxx: release: not found

Anything else we need to know?:

The above is my analysis, I hope you can see if it is correct.

Environment:

dixudx commented 1 year ago

@wl-chen Would you share some detailed information on the issue? Thanks.

wl-chen commented 1 year ago

@wl-chen Would you share some detailed information on the issue? Thanks.

Sorry, I slipped and didn't finish writing and sent it.

dixudx commented 1 year ago

This event is sporadic, but when viewing the source code, the guess may be related to this part of the code https://github.com/clusternet/clusternet/blob/main/pkg/utils/deployer.go#L193. When one of helmreleases that rendered by description install or upgrade failed, both description-controller and helmrelrease-controller will re-execute the handle helmrelease logic. The code just mentioned will mistakenly delete the chart being executed.

Yeah, the helm releases will be re-enqueued and handled by handleHelmRelease in the next round.

https://github.com/clusternet/clusternet/blob/main/pkg/utils/deployer.go#L193 will only uninstall failed releases when atomic is explicitly set to true. I don't know whether setting atomic to false (which is false by default) will help mitigate this problem.

When installing helmrelease by clusternet-hub, the above errors will cause helmrelease to generate the following events: an error occurred while uninstalling the release. original install error: services "service-xxx" not found: uninstall: Release not loaded: chart-xxx: release: not found

I wonder whether ignoring such not found helm release in function UninstallRelease will help solve the problem.

@wl-chen Please feel free to submit a PR if you've got a solution. Thanks.

cc @DanielXLee for this part of uninstalling logics.

wl-chen commented 1 year ago

This event is sporadic, but when viewing the source code, the guess may be related to this part of the code https://github.com/clusternet/clusternet/blob/main/pkg/utils/deployer.go#L193. When one of helmreleases that rendered by description install or upgrade failed, both description-controller and helmrelrease-controller will re-execute the handle helmrelease logic. The code just mentioned will mistakenly delete the chart being executed.

Yeah, the helm releases will be re-enqueued and handled by handleHelmRelease in the next round.

https://github.com/clusternet/clusternet/blob/main/pkg/utils/deployer.go#L193 will only uninstall failed releases when atomic is explicitly set to true. I don't know whether setting atomic to false (which is false by default) will help mitigate this problem.

When installing helmrelease by clusternet-hub, the above errors will cause helmrelease to generate the following events: an error occurred while uninstalling the release. original install error: services "service-xxx" not found: uninstall: Release not loaded: chart-xxx: release: not found

I wonder whether ignoring such not found helm release in function UninstallRelease will help solve the problem.

@wl-chen Please feel free to submit a PR if you've got a solution. Thanks.

cc @DanielXLee for this part of uninstalling logics.

You mean handleHelmRelease will only be executed by helmrelease-controller in the next round ? Will the code https://github.com/clusternet/clusternet/blob/main/pkg/hub/deployer/helm/helm.go#L355 be optimized ?

dixudx commented 1 year ago

You mean handleHelmRelease will only be executed by helmrelease-controller in the next round ?

handleHelmRelease will be executed by deployer in clusternet-agent / clusternet-hub. This depends on the cluster sync mode.

Will the code https://github.com/clusternet/clusternet/blob/main/pkg/hub/deployer/helm/helm.go#L355 be optimized ?

@wl-chen Feel free to make your changes if necessary. You'd better not introduce breaking changes.

DanielXLee commented 1 year ago

@wl-chen you can try to fix it and then verify in our test env. If works well, push your fix to the community.

dixudx commented 1 year ago

@wl-chen Would you please share some updates on this issue? Thanks.