fpgasystems / Coyote

Framework providing operating system abstractions and a range of shared networking (RDMA, TCP/IP) and memory services to common modern heterogeneous platforms.
MIT License
222 stars 69 forks source link

Coyote v2 RDMA fails under certain benchmark #71

Open zhenhaohe opened 3 months ago

zhenhaohe commented 3 months ago

I was testing the coyote v2 with rdma perf hw design and rdma services sw application.

  1. The RDMA read benchmark is unstable and fails under default amount of repetitions specified in the sw.The experiment below does not return.

./bin/test -d 0 -i 0 -t 10.1.212.177 -x 2048 Queue pair: Local : QPN 0x000000, PSN 0x22b267, VADDR 00007fe912200000, SIZE 00010000, IP 0x0afd4a60 Remote: QPN 0x000000, PSN 0x30c5c7, VADDR 00007feefbc00000, SIZE 00010000, IP 0x0afd4a5c Client registered Sent payload

RDMA BENCHMARK 1024 [bytes], thoughput: 19.94 [MB/s], latency: 33100.42 [ns] 2048 [bytes], thoughput: 2124.81 [MB/s], latency: 8167.80 [ns]

  1. The RDMA write benchmark does not scale beyond 4K message size:

./bin/test -d 0 -i 0 -t 10.1.212.175 -x 1024 -r 10 -l 10 -w 1 Queue pair: Local : QPN 0x000000, PSN 0x9bd652, VADDR 00007fbc23e00000, SIZE 00010000, IP 0x0afd4a58 Remote: QPN 0x000000, PSN 0xa03ec3, VADDR 00007fe9b5400000, SIZE 00010000, IP 0x0afd4a54 Client registered Sent payload

RDMA BENCHMARK 1024 [bytes], thoughput: 870.19 [MB/s], latency: 5824.05 [ns] 2048 [bytes], thoughput: 1976.83 [MB/s], latency: 6007.90 [ns] 4096 [bytes], thoughput: 3813.60 [MB/s], latency: 6559.50 [ns] ^Cterminate called after throwing an instance of 'std::runtime_error' what(): Stalled, SIGINT caught Aborted

JonasDann commented 2 months ago

Yes, this is a known issue with RDMA at the moment. @maximilianheer is working on a fix that is hopefully coming soon.