SoftRoCE / rxe-dev

Development Repository for RXE
Other
130 stars 54 forks source link

rxe failed connectivity test #51

Open mcfatealan opened 8 years ago

mcfatealan commented 8 years ago

hi @monis410 , I'm a RDMA beginner. I met a problem very similar to the previous issue (https://github.com/SoftRoCE/rxe-dev/issues/49).

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_devices
libibverbs: Warning: couldn't load driver 'rxe': librxe-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
    device                 node GUID
    ------              ----------------

I walk around it by moving /usr/lib64/* to /usr/lib/. But after that I have problems on connectivity tests.

My OS is Ubuntu 16.04 LTS (4.7.0-rc3+).

Some of my test result:

mcfatealan@mcfatealan-desktop:~/librxe-dev$ sudo rxe_cfg start 
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
  Name    Link  Driver  Speed  NMTU  IPv4_addr      RDEV  RMTU          
  enp5s0  yes   r8169          1500  192.168.10.19  rxe0  1024  (3)  

mcfatealan@mcfatealan-desktop:~/librxe-dev$ lsmod | grep rxe
rdma_rxe              102400  0
ip6_udp_tunnel         16384  1 rdma_rxe
udp_tunnel             16384  1 rdma_rxe
ib_core               208896  6 rdma_cm,ib_cm,iw_cm,ib_uverbs,rdma_rxe,rdma_ucm

mcfatealan@mcfatealan-desktop:~/librxe-dev$ lsmod | grep ib_uverbs
ib_uverbs              61440  1 rdma_ucm
ib_core               208896  6 rdma_cm,ib_cm,iw_cm,ib_uverbs,rdma_rxe,rdma_ucm

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_devices
    device                 node GUID
    ------              ----------------
    rxe0                be5ff4fffe3acd36

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_devinfo -d rxe0
hca_id: rxe0
    transport:          InfiniBand (0)
    fw_ver:             0.0.0
    node_guid:          be5f:f4ff:fe3a:cd36
    sys_image_guid:         0000:0000:0000:0000
    vendor_id:          0x0000
    vendor_part_id:         0
    hw_ver:             0x0
    phys_port_cnt:          1
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     1024 (3)
            sm_lid:         0
            port_lid:       0
            port_lmc:       0x00
            link_layer:     Ethernet

Then I tested connectivity both on one machines(self-to-self), and on one physical machine and a virtual machine. I can ping each other, so the connectivity of these machines is fine. The test result is exactly the same.

server:
mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 0 -d rxe0 -i 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0x2b7bf6, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000012, PSN 0x4255a9, GID fe80::be5f:f4ff:fe3a:cd36
//hanging...

client:
mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 0 -d rxe0 -i 1 192.168.10.19
  local address:  LID 0x0000, QPN 0x000012, PSN 0x4255a9, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000011, PSN 0x2b7bf6, GID fe80::be5f:f4ff:fe3a:cd36
//hanging...

server:
mcfatealan@mcfatealan-desktop:~/librxe-dev$ rping -s -a 192.168.10.19 -v -C 10
//hanging...

client:
mcfatealan@mcfatealan-desktop:~/librxe-dev$ rping -c -a 192.168.10.19 -v -C 10
//hanging...

Could you help me have a look on it? Thank you so much!

BTW, could Soft RoCE work well with python-rdma (https://github.com/jgunthorpe/python-rdma)? I tested that too and failed, not sure whether both two problems share the same root.

yonatanco commented 8 years ago

Hello. i know this issue. its a race condition that we recently fixed. I sent a fix for 4.8-rc5. you can work with upstream instead of github to be up to date.

BTW : are you working with a Mellanox's HCA ? or some ethernet NIC like Intel or Broadcom ?

mcfatealan commented 8 years ago

Hi @yonatanco , thanks so much for your responding! I'm trying 4.8-rc5 now, later I'll send you my feedbacks.

BTW, here's my hardware info:

mcfatealan@mcfatealan-desktop:~$ lspci | grep 'Ethernet'
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
mcfatealan commented 8 years ago

Oops. still got the same problem:

mcfatealan@mcfatealan-desktop:~$ uname -a
Linux mcfatealan-desktop 4.8.0-rc5 #1 SMP Mon Sep 12 14:12:15 CST 2016 x86_64 x86_64 x86_64 GNU/Linux

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 0 -d rxe0 -i 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0xeefb20, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000012, PSN 0x365328, GID fe80::be5f:f4ff:fe3a:cd36
//Hanging..

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 0 -d rxe0 -i 1
  local address:  LID 0x0000, QPN 0x000011, PSN 0xeefb20, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000012, PSN 0x365328, GID fe80::be5f:f4ff:fe3a:cd36
//Hanging..

mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 0 -d rxe0 -i 1 192.168.10.19
  local address:  LID 0x0000, QPN 0x000012, PSN 0x365328, GID fe80::be5f:f4ff:fe3a:cd36
  remote address: LID 0x0000, QPN 0x000011, PSN 0xeefb20, GID fe80::be5f:f4ff:fe3a:cd36
//Hanging..
yonatanco commented 8 years ago

you are using gid 0. try with gid 1.

ibv_rc_pingpong -g 1 -d rxe0 -i 1 ibv_rc_pingpong -g 1 -d rxe0 -i 1 192.168.10.19

mcfatealan commented 8 years ago

Thanks for reminding, @yonatanco . The result stays same..

mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 1 -d rxe0 -i 1
  local address:  LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19
  remote address: LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19

mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 1 -d rxe0 -i 1 192.168.10.19 
  local address:  LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19
  remote address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19

``
yonatanco commented 8 years ago

On 9/12/2016 2:12 PM, Chang Lou wrote:

Thanks for reminding. The result stays same..

|mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 1 -d rxe0 -i 1 local address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19 remote address: LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19 mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 1 -d rxe0 -i 1 192.168.10.19 local address: LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19 remote address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19 `` |

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SoftRoCE/rxe-dev/issues/51#issuecomment-246317141, or mute the thread https://github.com/notifications/unsubscribe-auth/AS6tfYiAn42AMO1T0tbi32-kgLZSi7Jzks5qpTOogaJpZM4J2kVU.

are trying to ping using the same host ? loopback ?

mcfatealan commented 8 years ago

@yonatanco , sorry to reply late.. I didn't receive notification on the main page. I'm wondering if there's any issue if I rping myself? I could do that on some of my other machines enabled with RNIC. I'm new to RDMA, maybe I'm being silly here...

anthonyliubin commented 7 years ago

Hello ,mcfatealan @yonatanco @mcfatealan I am testing the SoftRoce right now. And meet the same issue as you decribed. Firstly, i used the rxe-dev-master, the kernel is 4.0.0. But if i run rping , there is RDMA_CM_EVENT_ADDR_ERROR . Then i changed to use rxe-dev-rxe_submission_v18, the kernel is 4.7.0-rc3. The rping could run, but it is hang up. The server could not receive the RDMA_CM_EVENT_CONNECT_REQUEST event, so rping server side blocked in the function sem_wait(). I did it on the same VM, Loopback testing.

As you mentioned the Loopback issue, I aslo tested this case between two PC. One is Linux, the other is VM(NAT connection). We run rping server on PC, but when run client on the VM, the server is crashed, no response for any action.

You said that you have try the 4.8-rc5. I want to know how you achieve it, use rxe-dev branch or just upgrade the kernel? I want to continue the testing , thanks!

Best Regards Anthony

mcfatealan commented 7 years ago

Hi @anthonyliubin , I'm sorry to hear that you've had the same issue. The thing is that unluckily I still didn't pass the test in the end. My purpose was to find a temporary solution to test my RDMA codes before our server was fixed. The time spent on this project exceeded my expected limits, so I had to give up. But still I'd like to thank @yonatanco for all of his help!

About 4.8-rc5, I just upgraded my kernel.

It's kinda embarrassing that my answer might not provide any help. Anyway, that's all I know. Hope for the best!

anthonyliubin commented 7 years ago

hi, @mcfatealan Thanks for your response. I have a question, if we do not use rxe-dev branch, just upgrade the kernel, how to keep the rxe dev package in the new kernel? In my mind, if we compile the new kernel, it does not include the rxe. Do we need to port rxe? It is a big work. If you could give a simple explaination on how to upgrade, it will help us more! Thanks.

Best Regards Anthony

mcfatealan commented 7 years ago

I'm not 100% sure since it's been a while, but according to the description of @yonatanco , seems that rxe-dev already included in 4.8.0? I suggest you give a try :)

anthonyliubin commented 7 years ago

hi, @mcfatealan Thanks for your help.The rxe-dev already included in 4.8-rc5. We have tested this case in 4.7 and 4.8-rc5, both results are OK now(ibv_rc_pingpong and rping). In our testing, we need 2 PCs, Bridge connecting(No NAT, if use VM),clear iptables rules at first. ibv_rc_pingpong need use gid 1. And it does not support loop testing.

Best Regards Anthony

mcfatealan commented 7 years ago

@anthonyliubin congrats! so glad to hear that :) The points you mentioned are very helpful. Maybe I will test again next time according to your experience.

oTTer-Chief commented 7 years ago

Hi, I try to get rxe running on Debian 8.7 with Kernel 4.8.15 (rdma_rxe version 0.2) and face exactly the same issues. Neither rping nor ib_rc_pingpong are sending data if both run on the same machine.

In case of rping I get:

hutter@cbm01:~$ tail -n1 /var/log/messages
Jan 25 13:28:05 cbm01 kernel: [54644.488129] detected loopback device

Should this work loopback/on the same machine or is this unsupported?

anthonyliubin commented 7 years ago

hi, @oTTer-Chief

In my testing, it could not work loopback/on the same machine. You could try it via 2 PC.

Best Regards Anthony

oTTer-Chief commented 7 years ago

Hi @anthonyliubin ,

I tried testing between 2 VMs and this worked. Nevertheless I wonder if the loopback is intended to work and there is an error in my setup or if loopback is explicitly unsupported. If I have real RDMA hardware like Infiniband I am able to send to the same machine so I would assume the software representation is also able to do this.

Hunter21007 commented 7 years ago

Hi all,

Communication with the same machine is also required with GlusterFS RDMA transport...(which I was not able to do with Linux 4.9)

Peng-git-hub commented 7 years ago

you may try this: first:make sure that message can pass through the firewall iptables -F; iptables -t mangle -F then:add the IP address of both server and client to “trusted list” firewall-cmd --zone=trusted --add-source=1.1.1.1 --permanent firewall-cmd --zone=trusted --add-source=1.1.1.2 --permanent

Hunter21007 commented 7 years ago

Is this nesessary also if firewall is disabled?

Peng-git-hub commented 7 years ago

The default firewall rule is rejecting the unknown connection, and the direct test will be rejected by the remote firewall

byronyi commented 7 years ago

Any updates? Seems the loopback interface is not functioning for RDMACM, which is crucial for testing and local development.

monis410 commented 7 years ago

The RXE project maintenance in Github was stopped. You should move to upstream linux for kernel module and rdma-core (https://github.com/linux-rdma/rdma-core) for userspace library to get the latest features and bug fixes. Note that it is possible that some of the bugs you meet have fixes in drivers/infinibad/core (which means that they are common to all infiniband provideres)

byronyi commented 7 years ago

Thanks for your comment!

githubfoam commented 6 years ago

@Hunter21007 I also tried GlusterFS RDMA transport with kernel 4.9.0. what do you mean same machine? I have 2 VMs 2x NIC .1x NAT 1x host-only.did you get glusterfs rdma transport running with soft-roce?

byronyi commented 6 years ago

@githubfoam According to my inquiry in the linux-rdma mailing list, several RXE bugs were fixed in 4.9/10/11, and you are suggested to upgrade to 4.14/15 (e.g. Ubuntu 18.04 or Debian unstable). If the problem persists, let us know.

Hunter21007 commented 6 years ago

@githubfoam Host only means glusterfs server and client on same machine via 127.0.0.1. No I was not successful to make it work. And now it is even out of scope. Because glusterfs rdma support was dropped. So this one is not relevant anymore.

githubfoam commented 6 years ago

@Hunter21007 could you provide a link that shows glusterfs-rdma support is dropped ? Over here site suggests two links.However both links end nowhere. https://docs.gluster.org/en/v3/Administrator%20Guide/RDMA%20Transport/ Normally I build two servers NAT-ed in the same network.Glusterfs server/client works.TCP works but RDMA network does not.

githubfoam commented 6 years ago

@byronyi I tried what's suggested on this website.My nodes are configured with "Ubuntu 16.04.4 LTS-Linux 4.7.0-rc3+" after installing kernel/user spaces. I can't play ping pong.rxe testing fails. I dont get how you upgrade to 4.14-15 kernels.With this kernel spaces it is upgraded from "4.4.0-116-generic" to " 4.7.0-rc3+" https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home One of contributers say this github is not maintained anymore and suggests to follow "upstream kernel+rdma-core" method which is the one below link. https://community.mellanox.com/docs/DOC-2184 So I started trying this. My nodes are "Ubuntu 16.04.4 LTS-Kernel: Linux 4.17.0-rc6" "after kernel/rdma-core" installations.Problem is that there are missing steps .Like "sudo make install". At the bottom of the page someone tried and steps are different.

lalith-b commented 3 years ago

I am able to do ping pong with rxe but rdma_cm is failing when it comes to gluster-rdma support. the port 24008 is never started due to rdma_cm fails with [No Device Found]

githubfoam commented 3 years ago

@ lalith-b if you read whole thread glusterfs rdma was dropped at that time.If you have information that says otherwise could you please share? The point I left was TCP worked but RDMA did not