SoftRoCE / rxe-dev

Development Repository for RXE
Other
128 stars 55 forks source link

rxe_mem_init_user() results in NULL pointer value during scatter/gather walk #77

Closed fwmiller closed 5 years ago

fwmiller commented 5 years ago

I'm having a nasty little problem. I'm building rdma-core and the rxe driver into kernel 4.17 for an Altera Arria10 SoCFPGA that contains a dual core Cortex-A9 processor. I've got everything to build and the rxe device comes up fine:

root@arria10:~# rxe_cfg
rdma_rxe module not loaded
  Name  Link  Driver   Speed  NMTU  IPv4_addr  RDEV  RMTU
  eth0  yes   st_gmac         1500  10.0.1.30
root@arria10:~# rxe_cfg start
[   36.309689] rdma_rxe: loaded
[   36.342256] rdma_rxe: set rxe0 active
[   36.345932] rdma_rxe: added rxe0 to eth0
  Name  Link  Driver   Speed  NMTU  IPv4_addr  RDEV  RMTU
  eth0  yes   st_gmac         1500  10.0.1.30  rxe0  1024  (3)
root@arria10:~# rxe_cfg add eth0
root@arria10:~# rxe_cfg
  Name  Link  Driver   Speed  NMTU  IPv4_addr  RDEV  RMTU
  eth0  yes   st_gmac         1500  10.0.1.30  rxe0  1024  (3)
root@arria10:~# ibv_devices
    device                 node GUID
    ------              ----------------
    rxe0                541305fffe6da822
root@arria10:~# ibv_devinfo rxe0
hca_id: rxe0
        transport:                      InfiniBand (0)
        fw_ver:                         0.0.0
        node_guid:                      5413:05ff:fe6d:a822
        sys_image_guid:                 0000:0000:0000:0000
        vendor_id:                      0x0000
        vendor_part_id:                 0
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

This all looks good to me but maybe someone else will see something wrong here. Now, when I try to do a ping between this machine and a PC based VM running the rdma-core software, I get a strange error:

root@arria10:~# udaddy -s 10.0.1.16
udaddy: starting client[  210.963782] rdma_rxe: null vaddr

udaddy: connecting
failed to reg MR
udaddy: failed to create messages: -1
test complete
Segmentation fault

I traced this issue in the code to the file rxe_mr.c and the routine rxe_mem_init_user() It appears that a call to ib_mem_get() is returning a value for a variable umem that does not produce an error but later in the code, an iterator appears to try to walk down a list scatter/gather addresses and one of them comes up NULL which causes the error.

I wonder if anyone could comment or advise me on this error?

Thanks, FM

fwmiller commented 5 years ago

This turned out to be a nasty little bug. Turns out the ARM needs to use memory that is coming from the kernel's low memory pool in order for it to be able to be mapped between the user and kernel space. Here's a patch:

2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 5c2684b..f2dc5a7 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -31,6 +31,7 @@
  * SOFTWARE.
  */

+#include <linux/highmem.h>
 #include "rxe.h"
 #include "rxe_loc.h"

@@ -94,7 +95,15 @@ static void rxe_mem_init(int access, struct rxe_mem *mem)
 void rxe_mem_cleanup(struct rxe_pool_entry *arg)
 {
        struct rxe_mem *mem = container_of(arg, typeof(*mem), pelem);
-       int i;
+       int i, entry;
+       struct scatterlist *sg;
+
+       if (mem->kmap_occurred) {
+               for_each_sg(mem->umem->sg_head.sgl, sg,
+                           mem->umem->nmap, entry) {
+                       kunmap(sg_page(sg));
+               }
+       }

        if (mem->umem)
                ib_umem_release(mem->umem);
@@ -200,12 +209,14 @@ int rxe_mem_init_user(struct rxe_dev *rxe, struct rxe_pd *pd, u64 start,
                buf = map[0]->buf;

                for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
-                       vaddr = page_address(sg_page(sg));
+                       // vaddr = page_address(sg_page(sg));
+                       vaddr = kmap(sg_page(sg));
                        if (!vaddr) {
                                pr_warn("null vaddr\n");
                                err = -ENOMEM;
                                goto err1;
                        }
+                       mem->kmap_occurred = 1;

                        buf->addr = (uintptr_t)vaddr;
                        buf->size = BIT(umem->page_shift);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index af1470d..9bd7eac 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -343,6 +343,8 @@ struct rxe_mem {
        u32                     num_map;

        struct rxe_map          **map;
+
+       int                     kmap_occurred;
 };

 struct rxe_mc_grp {
--
2.7.4