antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.65k stars 362 forks source link

Vanilla k8s operator support and better lifecycle management #5962

Open ColonelBundy opened 7 months ago

ColonelBundy commented 7 months ago

Describe the problem/challenge you have Currently the operator at https://github.com/vmware/antrea-operator-for-kubernetes only has support for Openshift and even then the benefits are subpar. Lifecycle management using the operator is error prone due to the way the antrea config <-> AntreaInstall is setup it's pretty easy to brick your installation due to mistyping or removed fields which causes antrea agent to crash and connectivity will be lost. This is not unique to the operator since it's just a configmap mapping and could probably be handled better in the config handling overall. Also there's no mechanism to upgrade the controller first and agent second which is highlighted in the docs to be a potential issue even though I've not run into it myself yet.

Describe the solution you'd like Dedicated fields in the AntreaInstall manifest for each option in agent / controller config for better validation. Full lifecycle management, e.g Controlller gets updated first, Agent second and automatic removal / fallback to default values of invalid fields in case of upgrade.

Anything else you would like to add? My preference would be, Instead of having a wrapper around antrea for the operator and it being a another project, make this a first class citizen of the project and incorporate the operator controller of the https://github.com/vmware/antrea-operator-for-kubernetes into this repository's controller. If this as the default behavior would not be desirable for most users, a feature switch to deploy the AntreaInstall crds would also be sufficient.

roopeshsn commented 7 months ago

So you need two issues to be addressed.

Though the config validation should be done in the operator repo, right? @ColonelBundy

ColonelBundy commented 7 months ago

So you need two issues to be addressed.

  • Config validation and
  • Lifecycle management (controller,  followed by agent updates)

Though the config validation should be done in the operator repo, right? @ColonelBundy

Ideally it should be done in both. But rather than making it more complicated I rather see the operator get merged with this repo and tighter integrated with antrea and removal of configmaps for config since it's so error prone, and not to mention there's no staggered restart of the agents when you update anything in the configmap.

Regarding lifecycle management, yes that should be a priority to follow best practices according to the docs as a start. There are however more issues which I've highlighted here: https://github.com/vmware/antrea-operator-for-kubernetes/issues/86

I think there's a lot more issues than two that needs to be addressed for this to work without a hitch, but it's a start.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days