Mellanox / k8s-rdma-sriov-dev-plugin

Kubernetes Rdma SRIOV device plugin
Apache License 2.0
109 stars 27 forks source link

Unable to Connect the HCA's through the link #32

Open solielpai opened 3 years ago

solielpai commented 3 years ago

I deployed the rdma device plugin in HCA mode in kubernetes cluster. When I tried to make a connection test using "ib_read_bw", the output is as follows:

                RDMA_Write BW Test

Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 1024[B] Link type : Ethernet GID index : 0 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet


local address: LID 0000 xxx GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 remote address: LID 0000 xxx GID: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00

The commands I used are simply ' ib_write_bw -d mlx5_0 [target_ip]' and 'ib_read_bw -d mlx5_0 '. Could anyone please help with this issue? I appreciate your help.

goversion commented 3 years ago

I meet the same problem as you. Did you finally solve this problem, please? @solielpai

huide9 commented 1 year ago

+1 I'm using connectx-5. all ib_ commands work fine in host to host communication, but failed in containers.

heshengkai commented 1 year ago

@huide9 @solielpai @goversion +1 I'm using connectx-5. all ib_ commands work fine in host to host communication, but failed in containers, my k8s cluster network plugin calico, The Infiniband card works in Ethernet mode, which cause problems. If the Infiniband card works in IB mode, it is working properly

noama-nv commented 1 year ago

Link type is Ethernet: Server: ib_write_bw -d [RDMA_DEVICE] -F -R --report_gbits Client: ib_write_bw -d [RDMA_DEVICE] [SERVER_IP] -F -R --report_gbits

Link type is IB: Server: ib_write_bw -d [RDMA_DEVICE] -F --report_gbits Client: ib_write_bw -d [RDMA_DEVICE] [SERVER_IP] -F --report_gbits

heshengkai commented 1 year ago

Hi @krembu Server: [root@mofed-test-cx6-pod-1 /]# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1440 inet 172.16.62.144 netmask 255.255.255.255 broadcast 0.0.0.0 ether 36:bd:a0:74:97:5b txqueuelen 0 (Ethernet) RX packets 26 bytes 2068 (2.0 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 19 bytes 1490 (1.4 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1000 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

[root@mofed-test-cx6-pod-1 /]# ib_write_bw -d mlx5_2 -F -R --report_gbits


kubectl exec -it mofed-test-cx6-pod-1 bash^C [root@mofed-test-cx6-pod-1 /]# ib_write_bw -d mlx5_2 -F -R --report_gbits


Client: [root@mofed-test-cx6-pod-2 /]# ib_write_bw -d mlx5_2 172.16.62.144 -F -R --report_gbits Received 10 times ADDR_ERROR Unable to perform rdma_client function Unable to init the socket connection [root@mofed-test-cx6-pod-2 /]# ping 172.16.62.144 PING 172.16.62.144 (172.16.62.144) 56(84) bytes of data. 64 bytes from 172.16.62.144: icmp_seq=1 ttl=63 time=0.070 ms 64 bytes from 172.16.62.144: icmp_seq=2 ttl=63 time=0.047 ms 64 bytes from 172.16.62.144: icmp_seq=3 ttl=63 time=0.047 ms ^C --- 172.16.62.144 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2048ms rtt min/avg/max/mdev = 0.047/0.054/0.070/0.013 ms [root@mofed-test-cx6-pod-2 /]# ib_write_bw -d mlx5_2 172.16.62.144 -F -R --report_gbits Received 10 times ADDR_ERROR Unable to perform rdma_client function Unable to init the socket connection [root@mofed-test-cx6-pod-2 /]# ib_write_bw -d mlx5_2 172.16.62.144 -F -R --report_gbits Received 10 times ADDR_ERROR Unable to perform rdma_client function Unable to init the socket connection

noama-nv commented 1 year ago

can you share pod spec and MacvlanNetowrk?

heshengkai commented 1 year ago

The Link type is IB, and the cni is calico or macvlan. Both work properly。 Link type is Ethernet, cni is calico, the test is abnormal. cni is macvlan and the test is normal

noama-nv commented 1 year ago

Sorry getting you working out, this project is deprecated you can use https://github.com/mellanox/k8s-rdma-shared-dev-plugin or https://docs.nvidia.com/networking/display/COKAN10/Network+Operator

heshengkai commented 1 year ago

https://github.com/mellanox/k8s-rdma-shared-dev-plugin ,That's what I'm using

heshengkai commented 1 year ago

use image: mellanox/k8s-rdma-shared-dev-plugin

noama-nv commented 1 year ago

the issue are on https://github.com/Mellanox/k8s-rdma-sriov-dev-plugin

heshengkai commented 1 year ago

@krembu Thank you for your reply

wwj-2017-1117 commented 1 year ago

@huide9 ,we meet the same problem