erpc-io / eRPC

Efficient RPCs for datacenter networks
Other
851 stars 138 forks source link

help me~😭, "Failed to create self AH." #86

Closed gaishun closed 2 years ago

gaishun commented 2 years ago

Dear,

I have trouble implementing eRPC, could you please give me a favour, much appreciated.

I clone the repo, and compile the master branch by cmake -DPERF=OFF -DTRANSPORT=infiniband -DROCE=on; make -j16; (I use the latest rdma-core(41.0) and my Mellanox NIC works on the Ethernet link layer.)

All the tests(executed by sudo ctest) failed. Then, I try to run hello world. I compile it by make infiniband. When I started the server by sudo ./server, it remind me Failed to create self AH.

I have tried to deal with this problem for 3 days. GDB reports EINVAL which is caused by rdma-core internal. (rdma-core-41.0/libibverbs/cmd_fallback.c:246)

    req->command = write_method; // write_method = 0 
    req->in_words = __check_divide(req_size, 4);
    req->out_words = __check_divide(resp_size, 4);

    if (write(ctx->cmd_fd, req, req_size) != req_size) // cmd_fd = 19 
        return errno; // errno = 26

I have no idea... May it is caused by the latest version of rdma-core? (I guess)

By the way, could you please give me some suggestions about the study resources/ways/websites of rdma programming? Thank you for all your assistance.

Best wishes Sincerely

ankalia commented 2 years ago

Hi Gaisun. Could you please try with rdma-core v30.0? I don't have access to servers with the latest rdma-core so I won't be to fully reproduce this.

You can install rdma-core v30.0 using git checkout v30.0; cmake .; make -j; sudo make install

Here's an example script to uninstall the current rdma-core if you installed it from source:

#!/bin/bash
#
# Usage: ./rdma-core-uninstall.sh
#
# Uninstall a source install of rdma-core on an Ubuntu system. Mostly tested
# on Ubuntu 18.04 and rdma-core versions 20--36 ish.

echo "Deleting libib* libmlx* libefa.so librdmacm.so from /usr/local/lib"
cd /usr/local/lib
sudo rm -rf libib* libmlx* libefa.so librdmacm.so

echo "Deleting rdma/ and infiniband/ from /usr/local/include"
cd /usr/local/include
sudo rm -rf rdma infiniband
gaishun commented 2 years ago

Well, sincerely, thanks for your help. I try to use v30.0, but it doesn't work. (Still cannot create AH)

On my server, I aggregate two NICs (Mellanox Technologies MT27710 Family [ConnectX-4 Lx]) into one. Could that be the reason? It seems to be no relationship according to the doc.

ankalia commented 2 years ago

Thanks for trying it out. I will try to reproduce it.

gaishun commented 2 years ago

Thanks for your attention and help, I have solved this problem by changing kernel version (3.10.0). ( I have done a lot of work on the previous kernel. )

Thanks for your code, thansks for your help. Thanks again.