Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

Allreduce hangs after AlltoAll #200

Closed lawirz closed 2 months ago

lawirz commented 2 months ago

If you run alltoall, then allreduce, allreduce will hang both in simulator and rdma. The other order doesn't produce the bug.

Can be produced it by simply only activating the two testcases in test/host/xrt/src/test.cpp.

TEST_P(ACCLFuncTest, test_allreduce)

TEST_F(ACCLTest, test_alltoall) 
Setting up TestEnvironment
[----------] 1 test from ACCLTest
[ RUN      ] ACCLTest.test_alltoall
[       OK ] ACCLTest.test_alltoall[       OK ] ACCLTest.test_alltoall (65 ms)
 (22 ms)
[----------] 1 test from ACCLTest (22 ms total)

[----------] 1 test from ACCLTest (65 ms total)

[----------] 2 tests from reduction_tests/ACCLFuncTest
[----------] 2 tests from reduction_tests/ACCLFuncTest
[ RUN      ] reduction_tests/ACCLFuncTest.test_allreduce/0
[ RUN      ] reduction_tests/ACCLFuncTest.test_allreduce/0

If I let other tests in, it hangs even earlier.

I run it on the merged 196 error(one commit behind dev)

quetric commented 2 months ago

Thanks @lawirz for bringing this to my attention. Can you please provide the full command line you used to replicate this bug in simulation/emulation? (commands to start emulator/simulator, and commands to start test)

I've been able to replicate this for the specific case where an Eager protocol operation runs after the All2All (which always runs with Rendezvous). The bug does not seem to be particular to allreduce.

lawirz commented 2 months ago

I started the emulator using python run.py -n 2 -c cyt_rdma and the test using in test/host/xrt using mpirun -np 2 bin/test --cyt_rdma So the size should default to 16 for both cases I just commented all other test cases out.

quetric commented 2 months ago

@lawirz I pushed a fix to this issue to the bugfix branch. Please test on your side. For HW testing, it will require a rebuild of your bitstream.

lawirz commented 2 months ago

I was able to test in Simulator. There, the issue is fixed. Will test on HW too

lawirz commented 2 months ago

Works on hardware, too