Closed fwmiller closed 5 years ago
This turned out to be a nasty little bug. Turns out the ARM needs to use memory that is coming from the kernel's low memory pool in order for it to be able to be mapped between the user and kernel space. Here's a patch:
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 5c2684b..f2dc5a7 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -31,6 +31,7 @@
* SOFTWARE.
*/
+#include <linux/highmem.h>
#include "rxe.h"
#include "rxe_loc.h"
@@ -94,7 +95,15 @@ static void rxe_mem_init(int access, struct rxe_mem *mem)
void rxe_mem_cleanup(struct rxe_pool_entry *arg)
{
struct rxe_mem *mem = container_of(arg, typeof(*mem), pelem);
- int i;
+ int i, entry;
+ struct scatterlist *sg;
+
+ if (mem->kmap_occurred) {
+ for_each_sg(mem->umem->sg_head.sgl, sg,
+ mem->umem->nmap, entry) {
+ kunmap(sg_page(sg));
+ }
+ }
if (mem->umem)
ib_umem_release(mem->umem);
@@ -200,12 +209,14 @@ int rxe_mem_init_user(struct rxe_dev *rxe, struct rxe_pd *pd, u64 start,
buf = map[0]->buf;
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
- vaddr = page_address(sg_page(sg));
+ // vaddr = page_address(sg_page(sg));
+ vaddr = kmap(sg_page(sg));
if (!vaddr) {
pr_warn("null vaddr\n");
err = -ENOMEM;
goto err1;
}
+ mem->kmap_occurred = 1;
buf->addr = (uintptr_t)vaddr;
buf->size = BIT(umem->page_shift);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index af1470d..9bd7eac 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -343,6 +343,8 @@ struct rxe_mem {
u32 num_map;
struct rxe_map **map;
+
+ int kmap_occurred;
};
struct rxe_mc_grp {
--
2.7.4
I'm having a nasty little problem. I'm building rdma-core and the rxe driver into kernel 4.17 for an Altera Arria10 SoCFPGA that contains a dual core Cortex-A9 processor. I've got everything to build and the rxe device comes up fine:
This all looks good to me but maybe someone else will see something wrong here. Now, when I try to do a ping between this machine and a PC based VM running the rdma-core software, I get a strange error:
I traced this issue in the code to the file rxe_mr.c and the routine rxe_mem_init_user() It appears that a call to ib_mem_get() is returning a value for a variable umem that does not produce an error but later in the code, an iterator appears to try to walk down a list scatter/gather addresses and one of them comes up NULL which causes the error.
I wonder if anyone could comment or advise me on this error?
Thanks, FM