danieldzahka / FAM

BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

running error #23

Open shenben opened 1 year ago

shenben commented 1 year ago

In the danieldzahka/FAM [98c0045] (https://github.com/danieldzahka/FAM/commit/98c0045315be1f101fa11d0fc518a8f39deb6d59), after properly running cmake and make, some errors occured when executing the commands ./src/server and make test. the information is as follow:

(DGL) emc_admin@emcsvr01:~/FAM/build$ ./src/server
[2023-02-20 13:28:51.618] [info] Starting Server
[2023-02-20 13:28:51.619] [info] Server listening on 0.0.0.0:50051
E0220 13:28:51.620015625  109190 completion_queue.cc:254]    assertion failed: queue.num_items() == 0
Aborted (core dumped)
(DGL) emc_admin@emcsvr01:~/FAM/build$ make test
Running tests...
Test project /home/emc_admin/FAM/build
      Start  1: unittests.RPC Consruction
 1/59 Test  #1: unittests.RPC Consruction ........................................................................***Failed    0.02 sec
      Start  2: unittests.RPC Consruction Multi-Channel
 2/59 Test  #2: unittests.RPC Consruction Multi-Channel ..........................................................***Failed    0.02 sec
      Start  3: unittests.RPC PING
 3/59 Test  #3: unittests.RPC PING ...............................................................................***Failed    0.01 sec
      Start  4: unittests.RPC Allocate Region
 4/59 Test  #4: unittests.RPC Allocate Region ....................................................................***Failed    0.02 sec
      Start  5: unittests.Client Create rdma Buffer
 5/59 Test  #5: unittests.Client Create rdma Buffer ..............................................................***Failed    0.01 sec
      Start  6: unittests.rdma Write
 6/59 Test  #6: unittests.rdma Write .............................................................................***Failed    0.01 sec
      Start  7: unittests.rdma mmap
 7/59 Test  #7: unittests.rdma mmap ..............................................................................***Failed    0.01 sec
      Start  8: unittests.rdma mmap multi-channel
 8/59 Test  #8: unittests.rdma mmap multi-channel ................................................................***Failed    0.01 sec
      Start  9: unittests.Vector Read Upper Limit of WRs
 9/59 Test  #9: unittests.Vector Read Upper Limit of WRs .........................................................***Failed    0.01 sec
      Start 10: unittests.Test .idx reading
10/59 Test #10: unittests.Test .idx reading ......................................................................   Passed    0.00 sec
      Start 11: unittests.LocalGraph Construction - NopDecompressor, 0
11/59 Test #11: unittests.LocalGraph Construction - NopDecompressor, 0 ...........................................   Passed    0.02 sec
      Start 12: unittests.LocalGraph Construction - famgraph::tools::DeltaDecompressor, 1
12/59 Test #12: unittests.LocalGraph Construction - famgraph::tools::DeltaDecompressor, 1 ........................   Passed    0.01 sec
      Start 13: unittests.RemoteGraph Construction - NopDecompressor, 0
13/59 Test #13: unittests.RemoteGraph Construction - NopDecompressor, 0 ..........................................***Failed    0.01 sec
      Start 14: unittests.RemoteGraph Construction - famgraph::tools::DeltaDecompressor, 1
14/59 Test #14: unittests.RemoteGraph Construction - famgraph::tools::DeltaDecompressor, 1 .......................***Failed    0.02 sec
      Start 15: unittests.LocalGraph Vertex Table - NopDecompressor, 0
15/59 Test #15: unittests.LocalGraph Vertex Table - NopDecompressor, 0 ...........................................   Passed    0.01 sec
      Start 16: unittests.LocalGraph Vertex Table - famgraph::tools::DeltaDecompressor, 1
16/59 Test #16: unittests.LocalGraph Vertex Table - famgraph::tools::DeltaDecompressor, 1 ........................   Passed    0.01 sec
      Start 17: unittests.Vertex Filter
17/59 Test #17: unittests.Vertex Filter ..........................................................................   Passed    0.06 sec
      Start 18: unittests.Local Filter Edgemap - NopDecompressor, 0
18/59 Test #18: unittests.Local Filter Edgemap - NopDecompressor, 0 ..............................................   Passed    0.03 sec
      Start 19: unittests.Local Filter Edgemap - famgraph::tools::DeltaDecompressor, 1
19/59 Test #19: unittests.Local Filter Edgemap - famgraph::tools::DeltaDecompressor, 1 ...........................   Passed    0.03 sec
      Start 20: unittests.Local Filter Edgemap with Ranges - NopDecompressor, 0
20/59 Test #20: unittests.Local Filter Edgemap with Ranges - NopDecompressor, 0 ..................................   Passed    0.02 sec
      Start 21: unittests.Local Filter Edgemap with Ranges - famgraph::tools::DeltaDecompressor, 1
21/59 Test #21: unittests.Local Filter Edgemap with Ranges - famgraph::tools::DeltaDecompressor, 1 ...............   Passed    0.03 sec
      Start 22: unittests.Remote Filter Edgemap - NopDecompressor, 0
22/59 Test #22: unittests.Remote Filter Edgemap - NopDecompressor, 0 .............................................***Failed    0.01 sec
      Start 23: unittests.Remote Filter Edgemap - famgraph::tools::DeltaDecompressor, 1
23/59 Test #23: unittests.Remote Filter Edgemap - famgraph::tools::DeltaDecompressor, 1 ..........................***Failed    0.02 sec
      Start 24: unittests.Remote Filter Edgemap with Ranges - NopDecompressor, 0
24/59 Test #24: unittests.Remote Filter Edgemap with Ranges - NopDecompressor, 0 .................................***Failed    0.01 sec
      Start 25: unittests.Remote Filter Edgemap with Ranges - famgraph::tools::DeltaDecompressor, 1
25/59 Test #25: unittests.Remote Filter Edgemap with Ranges - famgraph::tools::DeltaDecompressor, 1 ..............***Failed    0.02 sec
      Start 26: unittests.LocalGraph Breadth First Search - NopDecompressor, 0
26/59 Test #26: unittests.LocalGraph Breadth First Search - NopDecompressor, 0 ...................................   Passed    0.02 sec
      Start 27: unittests.LocalGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1
27/59 Test #27: unittests.LocalGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1 ................   Passed    0.02 sec
      Start 28: unittests.RemoteGraph Breadth First Search - NopDecompressor, 0
28/59 Test #28: unittests.RemoteGraph Breadth First Search - NopDecompressor, 0 ..................................***Failed    0.02 sec
      Start 29: unittests.RemoteGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1
29/59 Test #29: unittests.RemoteGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1 ...............***Failed    0.02 sec
      Start 30: unittests.LocalGraph Kcore Decomposition - NopDecompressor, 0
30/59 Test #30: unittests.LocalGraph Kcore Decomposition - NopDecompressor, 0 ....................................   Passed    0.03 sec
      Start 31: unittests.LocalGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1
31/59 Test #31: unittests.LocalGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1 .................   Passed    0.02 sec
      Start 32: unittests.RemoteGraph Kcore Decomposition - NopDecompressor, 0
32/59 Test #32: unittests.RemoteGraph Kcore Decomposition - NopDecompressor, 0 ...................................***Failed    0.02 sec
      Start 33: unittests.RemoteGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1
33/59 Test #33: unittests.RemoteGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1 ................***Failed    0.01 sec
      Start 34: unittests.LocalGraph ConnectedComponents - NopDecompressor, 0
34/59 Test #34: unittests.LocalGraph ConnectedComponents - NopDecompressor, 0 ....................................   Passed    0.01 sec
      Start 35: unittests.LocalGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1
35/59 Test #35: unittests.LocalGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1 .................   Passed    0.01 sec
      Start 36: unittests.RemoteGraph ConnectedComponents - NopDecompressor, 0
36/59 Test #36: unittests.RemoteGraph ConnectedComponents - NopDecompressor, 0 ...................................***Failed    0.02 sec
      Start 37: unittests.RemoteGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1
37/59 Test #37: unittests.RemoteGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1 ................***Failed    0.02 sec
      Start 38: unittests.LocalGraph PageRank - NopDecompressor, 0
38/59 Test #38: unittests.LocalGraph PageRank - NopDecompressor, 0 ...............................................   Passed    0.02 sec
      Start 39: unittests.LocalGraph PageRank - famgraph::tools::DeltaDecompressor, 1
39/59 Test #39: unittests.LocalGraph PageRank - famgraph::tools::DeltaDecompressor, 1 ............................   Passed    0.02 sec
      Start 40: unittests.RemoteGraph PageRank - NopDecompressor, 0
40/59 Test #40: unittests.RemoteGraph PageRank - NopDecompressor, 0 ..............................................***Failed    0.01 sec
      Start 41: unittests.RemoteGraph PageRank - famgraph::tools::DeltaDecompressor, 1
41/59 Test #41: unittests.RemoteGraph PageRank - famgraph::tools::DeltaDecompressor, 1 ...........................***Failed    0.02 sec
      Start 42: unittests.Large Graph LocalGraph Breadth First Search - NopDecompressor, 0
42/59 Test #42: unittests.Large Graph LocalGraph Breadth First Search - NopDecompressor, 0 .......................***Failed    0.01 sec
      Start 43: unittests.Large Graph LocalGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1
43/59 Test #43: unittests.Large Graph LocalGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1 ....***Failed    0.01 sec
      Start 44: unittests.Large Graph RemoteGraph Breadth First Search - NopDecompressor, 0
44/59 Test #44: unittests.Large Graph RemoteGraph Breadth First Search - NopDecompressor, 0 ......................***Failed    0.02 sec
      Start 45: unittests.Large Graph RemoteGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1
45/59 Test #45: unittests.Large Graph RemoteGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1 ...***Failed    0.01 sec
      Start 46: unittests.Large Graph LocalGraph Kcore Decomposition - NopDecompressor, 0
46/59 Test #46: unittests.Large Graph LocalGraph Kcore Decomposition - NopDecompressor, 0 ........................***Failed    0.01 sec
      Start 47: unittests.Large Graph LocalGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1
47/59 Test #47: unittests.Large Graph LocalGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1 .....***Failed    0.01 sec
      Start 48: unittests.Large Graph RemoteGraph Kcore Decomposition - NopDecompressor, 0
48/59 Test #48: unittests.Large Graph RemoteGraph Kcore Decomposition - NopDecompressor, 0 .......................***Failed    0.01 sec
      Start 49: unittests.Large Graph RemoteGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1
49/59 Test #49: unittests.Large Graph RemoteGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1 ....***Failed    0.01 sec
      Start 50: unittests.Large Graph LocalGraph ConnectedComponents - NopDecompressor, 0
50/59 Test #50: unittests.Large Graph LocalGraph ConnectedComponents - NopDecompressor, 0 ........................***Failed    0.01 sec
      Start 51: unittests.Large Graph LocalGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1
51/59 Test #51: unittests.Large Graph LocalGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1 .....***Failed    0.01 sec
      Start 52: unittests.Large Graph RemoteGraph ConnectedComponents - NopDecompressor, 0
52/59 Test #52: unittests.Large Graph RemoteGraph ConnectedComponents - NopDecompressor, 0 .......................***Failed    0.01 sec
      Start 53: unittests.Large Graph RemoteGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1
53/59 Test #53: unittests.Large Graph RemoteGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1 ....***Failed    0.01 sec
      Start 54: unittests.Large Graph LocalGraph PageRank - NopDecompressor, 0
54/59 Test #54: unittests.Large Graph LocalGraph PageRank - NopDecompressor, 0 ...................................***Failed    0.01 sec
      Start 55: unittests.Large Graph LocalGraph PageRank - famgraph::tools::DeltaDecompressor, 1
55/59 Test #55: unittests.Large Graph LocalGraph PageRank - famgraph::tools::DeltaDecompressor, 1 ................***Failed    0.01 sec
      Start 56: unittests.Large Graph RemoteGraph PageRank - NopDecompressor, 0
56/59 Test #56: unittests.Large Graph RemoteGraph PageRank - NopDecompressor, 0 ..................................***Failed    0.01 sec
      Start 57: unittests.Large Graph RemoteGraph PageRank - famgraph::tools::DeltaDecompressor, 1
57/59 Test #57: unittests.Large Graph RemoteGraph PageRank - famgraph::tools::DeltaDecompressor, 1 ...............***Failed    0.01 sec
      Start 58: unittests.Pack and Unpack
58/59 Test #58: unittests.Pack and Unpack ........................................................................   Passed    0.02 sec
      Start 59: unittests.Compress Decompress
59/59 Test #59: unittests.Compress Decompress ....................................................................   Passed    0.26 sec

34% tests passed, 39 tests failed out of 59

Total Test time (real) =   1.17 sec

The following tests FAILED:
      1 - unittests.RPC Consruction (Failed)
      2 - unittests.RPC Consruction Multi-Channel (Failed)
      3 - unittests.RPC PING (Failed)
      4 - unittests.RPC Allocate Region (Failed)
      5 - unittests.Client Create rdma Buffer (Failed)
      6 - unittests.rdma Write (Failed)
      7 - unittests.rdma mmap (Failed)
      8 - unittests.rdma mmap multi-channel (Failed)
      9 - unittests.Vector Read Upper Limit of WRs (Failed)
     13 - unittests.RemoteGraph Construction - NopDecompressor, 0 (Failed)
     14 - unittests.RemoteGraph Construction - famgraph::tools::DeltaDecompressor, 1 (Failed)
     22 - unittests.Remote Filter Edgemap - NopDecompressor, 0 (Failed)
     23 - unittests.Remote Filter Edgemap - famgraph::tools::DeltaDecompressor, 1 (Failed)
     24 - unittests.Remote Filter Edgemap with Ranges - NopDecompressor, 0 (Failed)
     25 - unittests.Remote Filter Edgemap with Ranges - famgraph::tools::DeltaDecompressor, 1 (Failed)
     28 - unittests.RemoteGraph Breadth First Search - NopDecompressor, 0 (Failed)
     29 - unittests.RemoteGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1 (Failed)
     32 - unittests.RemoteGraph Kcore Decomposition - NopDecompressor, 0 (Failed)
     33 - unittests.RemoteGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1 (Failed)
     36 - unittests.RemoteGraph ConnectedComponents - NopDecompressor, 0 (Failed)
     37 - unittests.RemoteGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1 (Failed)
     40 - unittests.RemoteGraph PageRank - NopDecompressor, 0 (Failed)
     41 - unittests.RemoteGraph PageRank - famgraph::tools::DeltaDecompressor, 1 (Failed)
     42 - unittests.Large Graph LocalGraph Breadth First Search - NopDecompressor, 0 (Failed)
     43 - unittests.Large Graph LocalGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1 (Failed)
     44 - unittests.Large Graph RemoteGraph Breadth First Search - NopDecompressor, 0 (Failed)
     45 - unittests.Large Graph RemoteGraph Breadth First Search - famgraph::tools::DeltaDecompressor, 1 (Failed)
     46 - unittests.Large Graph LocalGraph Kcore Decomposition - NopDecompressor, 0 (Failed)
     47 - unittests.Large Graph LocalGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1 (Failed)
     48 - unittests.Large Graph RemoteGraph Kcore Decomposition - NopDecompressor, 0 (Failed)
     49 - unittests.Large Graph RemoteGraph Kcore Decomposition - famgraph::tools::DeltaDecompressor, 1 (Failed)
     50 - unittests.Large Graph LocalGraph ConnectedComponents - NopDecompressor, 0 (Failed)
     51 - unittests.Large Graph LocalGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1 (Failed)
     52 - unittests.Large Graph RemoteGraph ConnectedComponents - NopDecompressor, 0 (Failed)
     53 - unittests.Large Graph RemoteGraph ConnectedComponents - famgraph::tools::DeltaDecompressor, 1 (Failed)
     54 - unittests.Large Graph LocalGraph PageRank - NopDecompressor, 0 (Failed)
     55 - unittests.Large Graph LocalGraph PageRank - famgraph::tools::DeltaDecompressor, 1 (Failed)
     56 - unittests.Large Graph RemoteGraph PageRank - NopDecompressor, 0 (Failed)
     57 - unittests.Large Graph RemoteGraph PageRank - famgraph::tools::DeltaDecompressor, 1 (Failed)
Errors while running CTest
Output from these tests are in: /home/emc_admin/FAM/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [Makefile:71: test] Error 8

My Installation commands are:

mkdir build
cd build
cmake ..
make

_tbb 2021.8.0 gcc 7.5.0 cmake 3.25.2 ubuntu 20.04 x8664

Looking forward to your solution. Thank you @danieldzahka in advance!

danieldzahka commented 1 year ago

The command to launch the server: ./src/server is failing in the log that you provided, hence the line: E0220 13:28:51.620015625 109190 completion_queue.cc:254] assertion failed: queue.num_items() == 0. This is causing all of the tests that use RDMA to fail, because the server needs to run concurrently with them to perform tasks like registering memory and setting up connected QPs. What type of NIC do you have on the machine where you are running the command?

I would check:

  1. That you have an Infiniband HCA that supports RC transport
  2. That you have permissions to set up RDMA connections.
shenben commented 1 year ago
  1. My machines are now connected with the Ethernet that supports Soft-RoCE . rping works well.
  2. I have sudo permission .
  3. My NIC information :
    
    (DGL) emc_admin@emcsvr01:~$ sudo ethtool eno3
    Settings for eno3:
    Supported ports: [ TP ]
    Supported link modes:   10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Half 1000baseT/Full 
    Supported pause frame use: No
    Supports auto-negotiation: Yes
    Supported FEC modes: Not reported
    Advertised link modes:  10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Half 1000baseT/Full 
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Advertised FEC modes: Not reported
    Link partner advertised link modes:  10baseT/Half 10baseT/Full 
                                         100baseT/Half 100baseT/Full 
                                         1000baseT/Full 
    Link partner advertised pause frame use: No
    Link partner advertised auto-negotiation: Yes
    Link partner advertised FEC modes: Not reported
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    MDI-X: on
    Supports Wake-on: g
    Wake-on: d
    Current message level: 0x000000ff (255)
                   drv probe link timer ifdown ifup rx_err tx_err
    Link detected: yes
danieldzahka commented 1 year ago

The code has only ever been run on InfiniBand HCAs, so I'm not sure if there are any portability issues with ibverbs when running on RoCE. From googling the error message in your output:

E0220 13:28:51.620015625  109190 completion_queue.cc:254]    assertion failed: queue.num_items() == 0

It appears this is an error from gRPC.

Maybe one of these lines is failing from src/FAM/server.cpp

    ServerBuilder builder;
    builder.AddListeningPort(server_address, grpc::InsecureServerCredentials());
    builder.RegisterService(&service_);
    cq_ = builder.AddCompletionQueue();

    server_ = builder.BuildAndStart();