cybozu-go / coil

CNI plugin for Kubernetes designed for scalability and extensibility
Apache License 2.0
165 stars 20 forks source link

Install CNI Race condition when using multiple cni plugins. #193

Closed Cellebyte closed 2 years ago

Cellebyte commented 2 years ago

Describe the bug When I install an additional cni e.g. linkderd-cni. To accomplish more complex Network functionality.

I get race conditions between the 2 cnis. Either both work or the other cni is broken.

Environments

To Reproduce Steps to reproduce the behavior:

  1. Install coil cni with an default Addresspool
  2. Install linkerd-cni with linkerd proxies.
  3. Restart coild pods in kube-system namespace
  4. See linkerd-cni not working anymore.
  5. Restarting them fixes the problem, again.

Expected behavior Both CNI plugins should coexist. Coil should check if the cni config got updated and only change the parts which are referenced in the initial configmap. It should also subscribe to file changes and do the check accordingly. It should also imply to be the first plugin in the list and only update the parts from the initial configmap.

Additional context For this to be possible there need to be a fundamental change in how the cni configuration gets applied on initialization. So that coil is not bricking other cnis.

We currently need it to accomplish cross region encrypted connectivity between clusters and don't want to use the sidecar approach for it so we looked into a cni approach for a servicemesh and we discovered this race condition bug.

yamatcha commented 2 years ago

@Cellebyte

We try to reproduce the issue following procedure with the environment of coil e2e test. We referenced these documents. https://linkerd.io/2.11/features/cni/#using-the-cli. https://linkerd.io/2.11/getting-started/

$ make start
$ make install-coil
$ kubectl  apply -f manifests/default_pool.yaml
$ linkerd install-cni | kubectl apply -f -
$ linkerd install --linkerd-cni-enabled | kubectl apply -f -
$ curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/emojivoto.yml   | kubectl apply -f -
$ kubectl get -n emojivoto deploy -o yaml   | linkerd inject -   | kubectl apply -f -
$ kubectl rollout restart daemonset -n kube-system coild
$ linkerd check
$ linkerd -n emojivoto check --proxy

How can we confirm “linkerd-cni not working anymore” status?

Coil should check if the cni config got updated and only change the parts which are referenced in the initial configmap. It should also subscribe to file changes and do the check accordingly. It should also imply to be the first plugin in the list and only update the parts from the initial configmap.

We don’t intend to add any more complex function about cni config. The coil-installer can generate from the edited cni file. So you can add the setttings for linkerd to cni file, and let coil-installer generate it. ref: https://github.com/cybozu-go/coil/blob/14c0b64b7ae7f713e62634ca35db156a9666f3ee/docs/setup.md#edit-netconfjson

Also add an option not to use coil-controller if required. And you can generate cni file with linkerd.

Cellebyte commented 2 years ago

@yamatcha I think both ways make sense either adding a disable flag for coil or a disable flag for linkerd 👍 .

Cellebyte commented 2 years ago

@yamatcha I think we need to start also testing again on our side to provide more feedback to you. The general issue was that coil stopped working after restart of the pods when linkerd was installed and that can be fixed by letting one of the two cnis applying the cni config as you already suggested.

yamatcha commented 2 years ago

@Cellebyte The option of not using a coil controller has not yet been implemented. If you need it, let us know and we will try to implement it.