foundation-model-stack / multi-nic-cni

https://foundation-model-stack.github.io/multi-nic-cni/
Apache License 2.0
33 stars 5 forks source link

add deviceMapCache and update interfaces if device not found #153

Closed sunya-ch closed 10 months ago

sunya-ch commented 10 months ago

This PR is to fix issue mentioned in https://github.com/foundation-model-stack/multi-nic-cni/issues/152. This bug is limited to host dedicated CNI where the network device is moved back and forward between host and pod namespace.

The problem can be tracked from daemon log.

There are two situations that can cause the behavior mentioned in the issue.

  1. cannot get master device from PCI address by reading the file in /sys/bus/pci/devices.
2023/09/25 00:14:01 resource map: xxx <--- see target mapping pod to pci address
2023/09/25 00:14:01 nameNetMap: xxx <-- see target mapping from device name to network address
2023/09/25 00:14:01 deviceMap: map[] <-- cannot see target mapping from network address to device ID
  1. dedicated interfaces has not returned to hostinterface info yet. However, some shared interfaces exist.
2023/09/25 00:14:01 resource map: xxx <--- see target mapping pod to pci address
2023/09/25 00:14:01 nameNetMap: map of some interfaces <-- cannot see mapping of target devices

This PR includes

Log with chage:

2023/09/25 05:59:41 GetDeviceMap of xxx
2023/09/25 05:59:41 resource map: xxx
2023/09/25 05:59:41 nameNetMap map: map[ens4:xxx  ens5:xxx]
2023/09/25 05:59:41 set deviceMapCache xxx=yyy
2023/09/25 05:59:41 cannot list address on ens3: <nil>
2023/09/25 05:59:41 updated nameNetMap map: map[<target device>:xxx ens4:xxx ens5:xxx] <--- see update log here
2023/09/25 05:59:41 set deviceMapCache xxx=target device
2023/09/25 05:59:41 deviceMap: map[<target device> net:xxx ]
2023/09/25 05:59:41 GetMultiNicNetwork elapsed: 2804 us
2023/09/25 05:59:41 select by net <target device>
2023/09/25 05:59:41 select by net xxx (ens4)
2023/09/25 05:59:41 select by net xxx (ens5)
2023/09/25 05:59:41 xxx SelectNic elapsed: 98420 us
2023/09/25 05:59:41 return: {[  xxx] [ens4 ens5 <target device>]} <--- see target device added here 

Signed-off-by: Sunyanan Choochotkaew sunyanan.choochotkaew1@ibm.com