This is reproducible for me using the current dev branch.
Rank 1 receives messages on the Ethernet layer although it is not the destination of the message. This leads to deadlock situations for many ranks and messages because we run out of RX buffers(?)
I have observed this behavior for 16 ranks - although it does not cause deadlocks in that case.
You may want to use my changes for improved logging to see, which ETH and DMA logs belong to which rank: https://github.com/Mellich/ACCL/tree/improve-emulator-logging
Run the sendrcv_basic test for 16 ranks with the emulator:
The test should succeed, but in the logs we can see something like this:
[Rank 11: VERBOSE 11:51:58] ETH Send 128 bytes to 10
[Rank 1: VERBOSE 11:51:58] ETH Receive 128 bytes from 11
[Rank 10: VERBOSE 11:51:58] ETH Receive 128 bytes from 11
Rank 1 will receive messages from other ranks although it is not the destination of the message. This seems to only affect Rank 1 in this scenario.
This is reproducible for me using the current dev branch. Rank 1 receives messages on the Ethernet layer although it is not the destination of the message. This leads to deadlock situations for many ranks and messages because we run out of RX buffers(?) I have observed this behavior for 16 ranks - although it does not cause deadlocks in that case. You may want to use my changes for improved logging to see, which ETH and DMA logs belong to which rank: https://github.com/Mellich/ACCL/tree/improve-emulator-logging
Run the sendrcv_basic test for 16 ranks with the emulator:
The test should succeed, but in the logs we can see something like this:
Rank 1 will receive messages from other ranks although it is not the destination of the message. This seems to only affect Rank 1 in this scenario.