Open mcfatealan opened 8 years ago
Hello. i know this issue. its a race condition that we recently fixed. I sent a fix for 4.8-rc5. you can work with upstream instead of github to be up to date.
BTW : are you working with a Mellanox's HCA ? or some ethernet NIC like Intel or Broadcom ?
Hi @yonatanco , thanks so much for your responding! I'm trying 4.8-rc5 now, later I'll send you my feedbacks.
BTW, here's my hardware info:
mcfatealan@mcfatealan-desktop:~$ lspci | grep 'Ethernet'
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
Oops. still got the same problem:
mcfatealan@mcfatealan-desktop:~$ uname -a
Linux mcfatealan-desktop 4.8.0-rc5 #1 SMP Mon Sep 12 14:12:15 CST 2016 x86_64 x86_64 x86_64 GNU/Linux
mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 0 -d rxe0 -i 1
local address: LID 0x0000, QPN 0x000011, PSN 0xeefb20, GID fe80::be5f:f4ff:fe3a:cd36
remote address: LID 0x0000, QPN 0x000012, PSN 0x365328, GID fe80::be5f:f4ff:fe3a:cd36
//Hanging..
mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 0 -d rxe0 -i 1
local address: LID 0x0000, QPN 0x000011, PSN 0xeefb20, GID fe80::be5f:f4ff:fe3a:cd36
remote address: LID 0x0000, QPN 0x000012, PSN 0x365328, GID fe80::be5f:f4ff:fe3a:cd36
//Hanging..
mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 0 -d rxe0 -i 1 192.168.10.19
local address: LID 0x0000, QPN 0x000012, PSN 0x365328, GID fe80::be5f:f4ff:fe3a:cd36
remote address: LID 0x0000, QPN 0x000011, PSN 0xeefb20, GID fe80::be5f:f4ff:fe3a:cd36
//Hanging..
you are using gid 0. try with gid 1.
ibv_rc_pingpong -g 1 -d rxe0 -i 1
ibv_rc_pingpong -g 1 -d rxe0 -i 1 192.168.10.19
Thanks for reminding, @yonatanco . The result stays same..
mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 1 -d rxe0 -i 1
local address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19
remote address: LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19
mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 1 -d rxe0 -i 1 192.168.10.19
local address: LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19
remote address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19
``
On 9/12/2016 2:12 PM, Chang Lou wrote:
Thanks for reminding. The result stays same..
|mcfatealan@mcfatealan-desktop:~/librxe-dev$ ibv_rc_pingpong -g 1 -d rxe0 -i 1 local address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19 remote address: LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19 mcfatealan@mcfatealan-desktop:~$ ibv_rc_pingpong -g 1 -d rxe0 -i 1 192.168.10.19 local address: LID 0x0000, QPN 0x000013, PSN 0x4c0dd8, GID ::ffff:192.168.10.19 remote address: LID 0x0000, QPN 0x000012, PSN 0x5e5383, GID ::ffff:192.168.10.19 `` |
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SoftRoCE/rxe-dev/issues/51#issuecomment-246317141, or mute the thread https://github.com/notifications/unsubscribe-auth/AS6tfYiAn42AMO1T0tbi32-kgLZSi7Jzks5qpTOogaJpZM4J2kVU.
are trying to ping using the same host ? loopback ?
@yonatanco , sorry to reply late.. I didn't receive notification on the main page. I'm wondering if there's any issue if I rping myself? I could do that on some of my other machines enabled with RNIC. I'm new to RDMA, maybe I'm being silly here...
Hello ,mcfatealan @yonatanco @mcfatealan I am testing the SoftRoce right now. And meet the same issue as you decribed. Firstly, i used the rxe-dev-master, the kernel is 4.0.0. But if i run rping , there is RDMA_CM_EVENT_ADDR_ERROR . Then i changed to use rxe-dev-rxe_submission_v18, the kernel is 4.7.0-rc3. The rping could run, but it is hang up. The server could not receive the RDMA_CM_EVENT_CONNECT_REQUEST event, so rping server side blocked in the function sem_wait(). I did it on the same VM, Loopback testing.
As you mentioned the Loopback issue, I aslo tested this case between two PC. One is Linux, the other is VM(NAT connection). We run rping server on PC, but when run client on the VM, the server is crashed, no response for any action.
You said that you have try the 4.8-rc5. I want to know how you achieve it, use rxe-dev branch or just upgrade the kernel? I want to continue the testing , thanks!
Best Regards Anthony
Hi @anthonyliubin , I'm sorry to hear that you've had the same issue. The thing is that unluckily I still didn't pass the test in the end. My purpose was to find a temporary solution to test my RDMA codes before our server was fixed. The time spent on this project exceeded my expected limits, so I had to give up. But still I'd like to thank @yonatanco for all of his help!
About 4.8-rc5, I just upgraded my kernel.
It's kinda embarrassing that my answer might not provide any help. Anyway, that's all I know. Hope for the best!
hi, @mcfatealan Thanks for your response. I have a question, if we do not use rxe-dev branch, just upgrade the kernel, how to keep the rxe dev package in the new kernel? In my mind, if we compile the new kernel, it does not include the rxe. Do we need to port rxe? It is a big work. If you could give a simple explaination on how to upgrade, it will help us more! Thanks.
Best Regards Anthony
I'm not 100% sure since it's been a while, but according to the description of @yonatanco , seems that rxe-dev already included in 4.8.0? I suggest you give a try :)
hi, @mcfatealan Thanks for your help.The rxe-dev already included in 4.8-rc5. We have tested this case in 4.7 and 4.8-rc5, both results are OK now(ibv_rc_pingpong and rping). In our testing, we need 2 PCs, Bridge connecting(No NAT, if use VM),clear iptables rules at first. ibv_rc_pingpong need use gid 1. And it does not support loop testing.
Best Regards Anthony
@anthonyliubin congrats! so glad to hear that :) The points you mentioned are very helpful. Maybe I will test again next time according to your experience.
Hi, I try to get rxe running on Debian 8.7 with Kernel 4.8.15 (rdma_rxe version 0.2) and face exactly the same issues. Neither rping nor ib_rc_pingpong are sending data if both run on the same machine.
In case of rping I get:
hutter@cbm01:~$ tail -n1 /var/log/messages
Jan 25 13:28:05 cbm01 kernel: [54644.488129] detected loopback device
Should this work loopback/on the same machine or is this unsupported?
hi, @oTTer-Chief
In my testing, it could not work loopback/on the same machine. You could try it via 2 PC.
Best Regards Anthony
Hi @anthonyliubin ,
I tried testing between 2 VMs and this worked. Nevertheless I wonder if the loopback is intended to work and there is an error in my setup or if loopback is explicitly unsupported. If I have real RDMA hardware like Infiniband I am able to send to the same machine so I would assume the software representation is also able to do this.
Hi all,
Communication with the same machine is also required with GlusterFS RDMA transport...(which I was not able to do with Linux 4.9)
you may try this: first:make sure that message can pass through the firewall iptables -F; iptables -t mangle -F then:add the IP address of both server and client to “trusted list” firewall-cmd --zone=trusted --add-source=1.1.1.1 --permanent firewall-cmd --zone=trusted --add-source=1.1.1.2 --permanent
Is this nesessary also if firewall is disabled?
The default firewall rule is rejecting the unknown connection, and the direct test will be rejected by the remote firewall
Any updates? Seems the loopback interface is not functioning for RDMACM, which is crucial for testing and local development.
The RXE project maintenance in Github was stopped. You should move to upstream linux for kernel module and rdma-core (https://github.com/linux-rdma/rdma-core) for userspace library to get the latest features and bug fixes. Note that it is possible that some of the bugs you meet have fixes in drivers/infinibad/core (which means that they are common to all infiniband provideres)
Thanks for your comment!
@Hunter21007 I also tried GlusterFS RDMA transport with kernel 4.9.0. what do you mean same machine? I have 2 VMs 2x NIC .1x NAT 1x host-only.did you get glusterfs rdma transport running with soft-roce?
@githubfoam According to my inquiry in the linux-rdma mailing list, several RXE bugs were fixed in 4.9/10/11, and you are suggested to upgrade to 4.14/15 (e.g. Ubuntu 18.04 or Debian unstable). If the problem persists, let us know.
@githubfoam Host only means glusterfs server and client on same machine via 127.0.0.1. No I was not successful to make it work. And now it is even out of scope. Because glusterfs rdma support was dropped. So this one is not relevant anymore.
@Hunter21007 could you provide a link that shows glusterfs-rdma support is dropped ? Over here site suggests two links.However both links end nowhere. https://docs.gluster.org/en/v3/Administrator%20Guide/RDMA%20Transport/ Normally I build two servers NAT-ed in the same network.Glusterfs server/client works.TCP works but RDMA network does not.
@byronyi I tried what's suggested on this website.My nodes are configured with "Ubuntu 16.04.4 LTS-Linux 4.7.0-rc3+" after installing kernel/user spaces. I can't play ping pong.rxe testing fails. I dont get how you upgrade to 4.14-15 kernels.With this kernel spaces it is upgraded from "4.4.0-116-generic" to " 4.7.0-rc3+" https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home One of contributers say this github is not maintained anymore and suggests to follow "upstream kernel+rdma-core" method which is the one below link. https://community.mellanox.com/docs/DOC-2184 So I started trying this. My nodes are "Ubuntu 16.04.4 LTS-Kernel: Linux 4.17.0-rc6" "after kernel/rdma-core" installations.Problem is that there are missing steps .Like "sudo make install". At the bottom of the page someone tried and steps are different.
I am able to do ping pong with rxe but rdma_cm is failing when it comes to gluster-rdma support. the port 24008 is never started due to rdma_cm fails with [No Device Found]
@ lalith-b if you read whole thread glusterfs rdma was dropped at that time.If you have information that says otherwise could you please share? The point I left was TCP worked but RDMA did not
hi @monis410 , I'm a RDMA beginner. I met a problem very similar to the previous issue (https://github.com/SoftRoCE/rxe-dev/issues/49).
I walk around it by moving /usr/lib64/* to /usr/lib/. But after that I have problems on connectivity tests.
My OS is Ubuntu 16.04 LTS (4.7.0-rc3+).
Some of my test result:
Then I tested connectivity both on one machines(self-to-self), and on one physical machine and a virtual machine. I can ping each other, so the connectivity of these machines is fine. The test result is exactly the same.
Could you help me have a look on it? Thank you so much!
BTW, could Soft RoCE work well with python-rdma (https://github.com/jgunthorpe/python-rdma)? I tested that too and failed, not sure whether both two problems share the same root.