Open shenben opened 1 year ago
The command to launch the server: ./src/server
is failing in the log that you provided, hence the line: E0220 13:28:51.620015625 109190 completion_queue.cc:254] assertion failed: queue.num_items() == 0
. This is causing all of the tests that use RDMA to fail, because the server needs to run concurrently with them to perform tasks like registering memory and setting up connected QPs. What type of NIC do you have on the machine where you are running the command?
I would check:
rping
works well.sudo
permission .
(DGL) emc_admin@emcsvr01:~$ sudo ethtool eno3
Settings for eno3:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Link partner advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Link partner advertised pause frame use: No
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: on
Supports Wake-on: g
Wake-on: d
Current message level: 0x000000ff (255)
drv probe link timer ifdown ifup rx_err tx_err
Link detected: yes
The code has only ever been run on InfiniBand HCAs, so I'm not sure if there are any portability issues with ibverbs when running on RoCE. From googling the error message in your output:
E0220 13:28:51.620015625 109190 completion_queue.cc:254] assertion failed: queue.num_items() == 0
It appears this is an error from gRPC.
Maybe one of these lines is failing from src/FAM/server.cpp
ServerBuilder builder;
builder.AddListeningPort(server_address, grpc::InsecureServerCredentials());
builder.RegisterService(&service_);
cq_ = builder.AddCompletionQueue();
server_ = builder.BuildAndStart();
In the danieldzahka/FAM [98c0045] (https://github.com/danieldzahka/FAM/commit/98c0045315be1f101fa11d0fc518a8f39deb6d59), after properly running
cmake
andmake
, some errors occured when executing the commands./src/server
andmake test
. the information is as follow:My Installation commands are:
_tbb 2021.8.0 gcc 7.5.0 cmake 3.25.2 ubuntu 20.04 x8664
Looking forward to your solution. Thank you @danieldzahka in advance!