Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

ETH messages received by rank 1 although not destination of message #172

Closed Mellich closed 10 months ago

Mellich commented 10 months ago

This is reproducible for me using the current dev branch. Rank 1 receives messages on the Ethernet layer although it is not the destination of the message. This leads to deadlock situations for many ranks and messages because we run out of RX buffers(?) I have observed this behavior for 16 ranks - although it does not cause deadlocks in that case. You may want to use my changes for improved logging to see, which ETH and DMA logs belong to which rank: https://github.com/Mellich/ACCL/tree/improve-emulator-logging

Run the sendrcv_basic test for 16 ranks with the emulator:

mpirun --tag-output -n 16 bin/test --gtest_filter=ACCLTest.test_sendrcv_basic -d
python run.py -n 16 -l 4 

The test should succeed, but in the logs we can see something like this:

[Rank  11: VERBOSE  11:51:58] ETH Send 128 bytes to 10
[Rank   1: VERBOSE  11:51:58] ETH Receive 128 bytes from 11
[Rank  10: VERBOSE  11:51:58] ETH Receive 128 bytes from 11

Rank 1 will receive messages from other ranks although it is not the destination of the message. This seems to only affect Rank 1 in this scenario.

quetric commented 10 months ago

Fixed in #174