Mellanox / k8s-rdma-sriov-dev-plugin

Kubernetes Rdma SRIOV device plugin
Apache License 2.0
110 stars 27 forks source link

Everything seems ok but no vhca device in test Pod. #9

Open flymark2010 opened 6 years ago

flymark2010 commented 6 years ago

Hi, I've met a problem and have no idea how to fix. I have several nodes deployed with rdma sriov device plugin and the sriov cni, and everythin goes ok and pods can communicate with others via the vhca device whether the pods are launched on the same node or not. But one day, one the node goes bad, new pod launched on it fails to require a vhca device(the pod is launched normally and in Running phase), but everything seems ok. I've checked the log as below:

  1. checking the rdma/vhca resource on node:

    # kubectl describe node 10.128.2.30  
    ...
    Capacity:
    cpu:                48
    ephemeral-storage:  52399108Ki
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             131747876Ki
    nvidia.com/gpu:     8
    pods:               110
    rdma/vhca:          8
    Allocatable:
    cpu:                48
    ephemeral-storage:  48291017853
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             131645476Ki
    nvidia.com/gpu:     8
    pods:               110
    rdma/vhca:          8
    ...
  2. When creating a new pod on the node, I get log from rdma sriov device plugin:

    2018/08/07 07:33:33 allocate request: &AllocateRequest{ContainerRequests:[&ContainerAllocateRequest{DevicesIDs:[16:5f:e4:4f:a7:28],}],}
    2018/08/07 07:33:33 allocate response:  {[&ContainerAllocateResponse{Envs:map[string]string{},Mounts:[],Devices:[&DeviceSpec{ContainerPath:/dev/infiniband,HostPath:/dev/infiniband,Permissions:rwm,}],Annotations:map[string]string{},}]}
  3. I use test-sriov-pod.yaml to create test pod, the pod can be lauched normally and in Running phase, but the network interface is not a vhca device and no vhca devices found with show_gids:

    
    # ethtool -i eth0
    driver: veth
    version: 1.0
    firmware-version: 
    expansion-rom-version: 
    bus-info: 
    supports-statistics: yes
    supports-test: no
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: no

show_gids

DEV PORT INDEX GID IPv4 VER DEV


n_gids_found=0


4.  The sriov cni configuration is as below and it's the only cni on that node:

{ "name": "mynet", "type": "sriov", "if0": "ens5f0", "ipam": { "type": "host-local", "subnet": "10.55.206.0/24 "rangeStart": "10.55.206.11", "rangeEnd": "10.55.206.19", "routes": [ { "dst": "0.0.0.0/0" } ], "gateway": "10.55.206.1" } }


Besides, I found that all the vhca interface is in `down` status with command `ip a`, and I let them up with command `ifconfig <eth-name> up` manually, but nothing is changed.

Thanks for your help!
paravmellanox commented 6 years ago

Hi @flymark2010, I see the output of ethtool -i eth0.

Driver is veth instead of mlx5.

It indicates that instead of sriov-cni some other cni provided the eth device. You might want to check if you have any other cni config file which is taking priority due to lexical ordering that you faced last time.

flymark2010 commented 6 years ago

I'm sure that the calic cni is disable on that node, and there's only a sriov cni config file in path /etc/cni/net.d. That's why I'm confused.

paravmellanox commented 6 years ago

@flymark2010, veth is certainly not a mlx5 driver. So something went wrong there. you can try to unload veth driver at that host using rmmod veth and see that this interface goes away from the Pod.