Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

Deadlocks with "complex" communication patterns #160

Closed Mellich closed 10 months ago

Mellich commented 10 months ago

For certain P2P communication patterns, ACCL seems to lose data if messages exceed a certain size which leads to deadlocks. This is reproducible with the ACCL emulator using the UDP and TCP backend. I created a minimal example using 3 ranks based on the ACCL unit tests: https://github.com/Mellich/ACCL/blob/3b2c02440282fc2cecb70f7325572e490e4cd984/test/host/xrt/src/test.cpp#L76C29-L76C29

Execution with up to 368 values per message succeeds:

mpirun --tag-output -n 3 bin/test --gtest_filter=ACCLTest.test_complex_comm --rxbuf-size 4 -s 368

However, executions with 384 values and more deadlock. An interesting observation is, that the first non-working number of values (384) equals 1536 bytes which again equals the maximum packet size defined in the cclo firmware header.

quetric commented 10 months ago

Hi @Mellich thanks for taking the time to report this.

I've reproduced your issue exactly for UDP however deadlocks start at 386 elements for TCP.

Please check if this is the case for you as well.

quetric commented 10 months ago

On further investigation, this appears to be a problem with sendrecv in general, as the following also deadlocks:

mpirun --tag-output -n 2 bin/test --gtest_filter=ACCLTest.test_sendrcv --rxbuf-size 4 -s 384 -u

Mellich commented 10 months ago

Also for TCP, the deadlock happens already for 384 elements. Maybe it is not a hard bound? I tried with TCP and even fewer elements and it deadlocked. Sometimes the first receive of rank 2 even completes but validation of the data fails.

I can also reproduce the deadlock for test_sendrcv and 384 elements.

quetric commented 10 months ago

This might be separate problems for TCP and UDP.

For UDP i've found the problem and a fix - see the associated branch. I'm still looking into the TCP deadlock.

quetric commented 10 months ago

@Mellich the TCP problem seems to be a race condition, same test fails intermittently. Is this what you see as well? Or is there a specific message size which always fails the test?

Mellich commented 10 months ago

Yes, a race condition seems quite likely for TCP. As mentioned above, I get different behavior for the same message size. In some cases, the first receive even succeeds but with the wrong data. So maybe something with the RX buffer bookkeeping?

I tried the fix for UDP and it also works for me in emulation. I will also validate the fix in hardware.

quetric commented 10 months ago

@Mellich i've pushed a fix for TCP as well. Indeed this was a problem with handling fragmentation whereby when messages were interleaved in a specific way (hence the intermittent failures) a fragment was assigned to a different message, resulting in corruption of the mis-targeted message (hence the data mismatch) and the deadlock caused by the target message never being filled.

Please test before I merge this into dev.

quetric commented 10 months ago

Closing as the issues are fixed in emulator, as initially reported