SoftRoCE / rxe-dev

Development Repository for RXE
Other
130 stars 56 forks source link

ibv_rc_pingpong not working on centos 8 #79

Closed ffly90 closed 4 years ago

ffly90 commented 4 years ago

I set up rxe on two virtual machines to play around a little bit. I used the distribution packages of centos 8.2 and everything seems to work as it should.

[user@node1 ~]$  ibv_devices
    device                 node GUID
    ------              ----------------
    rxe0                505400fffe96b707
[user@node1 ~]$ ibv_devinfo -d rxe0
hca_id: rxe0
        transport:                      InfiniBand (0)
        fw_ver:                         0.0.0
        node_guid:                      5054:00ff:fe96:b707
        sys_image_guid:                 0000:0000:0000:0000
        vendor_id:                      0x0000
        vendor_part_id:                 0
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

even ib_send_bw -d rxe0 seems to work as intended.

but ibv_rc_pinpong fails like this:

[user@node1 ~]$ ibv_rc_pingpong -g 0 -d rxe0 -i 1
  local address:  LID 0x0000, QPN 0x000027, PSN 0x3c15d0, GID fe80::5054:ff:fe96:b707
Failed to modify QP to RTR
Couldn't connect to remote QP
[user@node2 ~]$ ibv_rc_pingpong -g 0 -d rxe0 -i 1 10.0.0.1
  local address:  LID 0x0000, QPN 0x00004e, PSN 0x3243f9, GID fe80::5054:ff:fe44:db57
client read/write: Protocol not supported
Couldn't read/write remote address

Also if I try to run a mpi hello world with openmpi I get the following error:

[user@node1 ~]$ mpirun --mca pml ob1 --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 -np 8 --hostfile ~/openmpi_proj/my_hostfile ~/openmpi_proj/hello_c            
--------------------------------------------------------------------------                                                                                                                                         
No OpenFabrics connection schemes reported that they were able to be                                                                                                                                               
used on a specific port.  As such, the openib BTL (OpenFabrics                                                                                                                                                     
support) will be disabled for this port.                                                                                                                                                                           

  Local host:           node1
  Local device:         rxe0
  Local port:           1                                                                                
  CPCs attempted:       rdmacm                                                                           
--------------------------------------------------------------------------
Hello, world, I am 0 of 8, (Open MPI v4.0.2, package: Open MPI mockbuild@x86-02.mbox.centos.org Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 127)
Hello, world, I am 1 of 8, (Open MPI v4.0.2, package: Open MPI mockbuild@x86-02.mbox.centos.org Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 127)
Hello, world, I am 2 of 8, (Open MPI v4.0.2, package: Open MPI mockbuild@x86-02.mbox.centos.org Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 127)
Hello, world, I am 4 of 8, (Open MPI v4.0.2, package: Open MPI mockbuild@x86-02.mbox.centos.org Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 127)
Hello, world, I am 5 of 8, (Open MPI v4.0.2, package: Open MPI mockbuild@x86-02.mbox.centos.org Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 127)
Hello, world, I am 6 of 8, (Open MPI v4.0.2, package: Open MPI mockbuild@x86-02.mbox.centos.org Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 127)
Hello, world, I am 7 of 8, (Open MPI v4.0.2, package: Open MPI mockbuild@x86-02.mbox.centos.org Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 127)
Hello, world, I am 3 of 8, (Open MPI v4.0.2, package: Open MPI mockbuild@x86-02.mbox.centos.org Distribution, ident: 4.0.2, repo rev: v4.0.2, Oct 07, 2019, 127)
[node1:02111] 7 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[node1:02111] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Am I running into a bug or am I just to not good enough to set rxe up right?

mohammadheib commented 4 years ago

Hi @firefly-serenity, I ran into the same problem recently in fedora 32 OS, but i believe the issue in your OS caused by the same reason: in my os i ran into the following scenario:

  1. before creating the rxe dev: a. add ipv4 to the net device interface. b. add ipv6 to the net device interface(which probably will be added automatically as link-local ip)

  2. create the rxe device on top of the configured interface.

afterward, print the gid table of your rxe device: cat /sys/class/infiniband/*/ports/1/gids/* | head -n 4 fe80:0000:0000:0000:76e6:e2ff:fe05:0c78 0000:0000:0000:0000:0000:ffff:0101:0102 0000:0000:0000:0000:0000:0000:0000:0000 0000:0000:0000:0000:0000:0000:0000:0000

you can see that gid 0 represents the ipv6 of the interface but for unknown reason in my system the ipv6 disappeared after creating the rxe or maybe before creating the rxe but it's has been deleted and now there is no such ip on my client-side nor on the server-side.

therefore, To resolve the issue in my system i configured the ipv6 again after creating the rxe device on both the client and the server side. the ipv6 can be found in the gid table.

if you don't care about the ip version you can use gid #1and everything will work as expected

ffly90 commented 4 years ago

Thank you @mohamedheib that solved my issue. I had disabled IPv6 because I thought rxe relies on IPv4 only. But with your insights I got it working :)