Closed ffly90 closed 4 years ago
Hi @firefly-serenity, I ran into the same problem recently in fedora 32 OS, but i believe the issue in your OS caused by the same reason: in my os i ran into the following scenario:
before creating the rxe dev: a. add ipv4 to the net device interface. b. add ipv6 to the net device interface(which probably will be added automatically as link-local ip)
create the rxe device on top of the configured interface.
afterward,
print the gid table of your rxe device:
cat /sys/class/infiniband/*/ports/1/gids/* | head -n 4
fe80:0000:0000:0000:76e6:e2ff:fe05:0c78
0000:0000:0000:0000:0000:ffff:0101:0102
0000:0000:0000:0000:0000:0000:0000:0000
0000:0000:0000:0000:0000:0000:0000:0000
you can see that gid 0 represents the ipv6 of the interface but for unknown reason in my system the ipv6 disappeared after creating the rxe or maybe before creating the rxe but it's has been deleted and now there is no such ip on my client-side nor on the server-side.
therefore, To resolve the issue in my system i configured the ipv6 again after creating the rxe device on both the client and the server side. the ipv6 can be found in the gid table.
if you don't care about the ip version you can use gid #1and everything will work as expected
Thank you @mohamedheib that solved my issue. I had disabled IPv6 because I thought rxe relies on IPv4 only. But with your insights I got it working :)
I set up rxe on two virtual machines to play around a little bit. I used the distribution packages of centos 8.2 and everything seems to work as it should.
even ib_send_bw -d rxe0 seems to work as intended.
but ibv_rc_pinpong fails like this:
Also if I try to run a mpi hello world with openmpi I get the following error:
Am I running into a bug or am I just to not good enough to set rxe up right?