foundation-model-stack / multi-nic-cni

https://foundation-model-stack.github.io/multi-nic-cni/
Apache License 2.0
33 stars 5 forks source link

[v1.2.0] mellanox host-device failed to assign multiple times #152

Closed sunya-ch closed 7 months ago

sunya-ch commented 10 months ago

Describe the bug A clear and concise description of what the bug is.

For host-dedicated device, creating pod right after deletion of previous pod requires sometime (after multiple attempts) to be success. (got zero config error)

Warning FailedCreatePodSandBox 1s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_testpod1_default_xxx): error adding pod default_testpod1 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [default/testpod1/xxx:multi-nic-network]: error adding container to network "multi-nic-network": zero config &{{0.3.0 multi-nic-network multi-nic map[] {host-device-ipam} {[] [] []} map[] } map[cniVersion:0.3.1 type:host-device] 172.30.0.0/16 [] [ ] [ens5 ens4] false 11000 }

To Reproduce Steps to reproduce the behavior:

  1. Deploy two multinicnetwork with RoCE GDR and TCP at the same time
  2. Deploy pod with RoCE/GDR, delete it, and redeploy right after deletion

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Add any other context about the problem here.