Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

Gather wrong order #198

Open lawirz opened 3 months ago

lawirz commented 3 months ago

This issue concerns the branch to resolve issue 196: https://github.com/Xilinx/ACCL/tree/196-reduceallreduce-issues-on-cyt_rdma

Gather sometimes switches up the output of the first rank and the second rank on two-node setups, when run on cyt_rdma. The error is not observed in the emulator setup. In HW, it only happens in around 50% of runs.

Allgather on the other hand doesn't produce erronous behaviour.

It only occured after recompiling test/host/Coyote/test.cpp. The binary compiled on the previous version running with a new bitstream worked.

Rank 0

stdout ``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '7' '-c' '24' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:24 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 0] rank 0 size 2 alveo-u55c-04.inf.ethz.ch Testing ACCL base functionality... 10.253.74.80 10.253.74.92 Initializing QP connections... Exchanging QP... Local rank 0 sending local QP to remote rank 1 Local rank 0 receiving remote QP from remote rank 1 Queue Pair: id: 1 Local Queue: local: QPN 0x000002, PSN 0x2aec2a, VADDR 00007f1980200000, SIZE 00200000, IP 0x0afd4a50, Remote Queue: remote: QPN 0x000001, PSN 0x6f7034, VADDR 00007f431ce00000, SIZE 00200000, IP 0x0afd4a5c, rank: 0 FPGA IP: afd4a50 Rendezvous Protocol sw nop time [us]:92.656 hw nop time [ns]:940 Start gather test with root 0... Repetition 0 Pass accl barrier host measured durationUs:42.371 1th item is incorrect! (24.000000 != 0.000000) 2th item is incorrect! (25.000000 != 1.000000) 3th item is incorrect! (26.000000 != 2.000000) 4th item is incorrect! (27.000000 != 3.000000) 5th item is incorrect! (28.000000 != 4.000000) 6th item is incorrect! (29.000000 != 5.000000) 7th item is incorrect! (30.000000 != 6.000000) 8th item is incorrect! (31.000000 != 7.000000) 9th item is incorrect! (32.000000 != 8.000000) 10th item is incorrect! (33.000000 != 9.000000) 11th item is incorrect! (34.000000 != 10.000000) 12th item is incorrect! (35.000000 != 11.000000) 13th item is incorrect! (36.000000 != 12.000000) 14th item is incorrect! (37.000000 != 13.000000) 15th item is incorrect! (38.000000 != 14.000000) 16th item is incorrect! (39.000000 != 15.000000) 17th item is incorrect! (40.000000 != 16.000000) 18th item is incorrect! (41.000000 != 17.000000) 19th item is incorrect! (42.000000 != 18.000000) 20th item is incorrect! (43.000000 != 19.000000) 21th item is incorrect! (44.000000 != 20.000000) 22th item is incorrect! (45.000000 != 21.000000) 23th item is incorrect! (46.000000 != 22.000000) 24th item is incorrect! (47.000000 != 23.000000) 1th item is incorrect! (0.000000 != 24.000000) 2th item is incorrect! (1.000000 != 25.000000) 3th item is incorrect! (2.000000 != 26.000000) 4th item is incorrect! (3.000000 != 27.000000) 5th item is incorrect! (4.000000 != 28.000000) 6th item is incorrect! (5.000000 != 29.000000) 7th item is incorrect! (6.000000 != 30.000000) 8th item is incorrect! (7.000000 != 31.000000) 9th item is incorrect! (8.000000 != 32.000000) 10th item is incorrect! (9.000000 != 33.000000) 11th item is incorrect! (10.000000 != 34.000000) 12th item is incorrect! (11.000000 != 35.000000) 13th item is incorrect! (12.000000 != 36.000000) 14th item is incorrect! (13.000000 != 37.000000) 15th item is incorrect! (14.000000 != 38.000000) 16th item is incorrect! (15.000000 != 39.000000) 17th item is incorrect! (16.000000 != 40.000000) 18th item is incorrect! (17.000000 != 41.000000) 19th item is incorrect! (18.000000 != 42.000000) 20th item is incorrect! (19.000000 != 43.000000) 21th item is incorrect! (20.000000 != 44.000000) 22th item is incorrect! (21.000000 != 45.000000) 23th item is incorrect! (22.000000 != 46.000000) 24th item is incorrect! (23.000000 != 47.000000) 48 errors! ERROR: ACCL base functionality test failed! STATISTICS - ID: 0 ----------------------------------------------- Read command FIFO used: 0 Write command FIFO used: 0 Host reads sent: 1 Host writes sent: 2 Card reads sent: 1 Card writes sent: 1 Sync reads sent: 5 Sync writes sent: 0 Page faults: 0 NET STATS QSFP0 RX pkgs: 50 TX pkgs: 5 ARP RX pkgs: 2 ARP TX pkgs: 2 ICMP RX pkgs: 0 ICMP TX pkgs: 0 TCP RX pkgs: 0 TCP TX pkgs: 0 ROCE RX pkgs: 3 ROCE TX pkgs: 3 IBV RX pkgs: 6 IBV TX pkgs: 4 PSN drop cnt: 0 Retrans cnt: 0 TCP session cnt: 0 STRM down: 0 Finalizing MPI... Done. Terminating... ```
stderr ``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 256891 UID: 500207 [Wed May 29 21:24:18 2024 GMT] HOST: alveo-u55c-04.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 CCLO HWID: 4147289406 at 0x0 CCLO source commit (first 24b): f7329d CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7f197f600000, Size: 64 calling offload: 7f197f600000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7f197f400000, Size: 64 calling offload: 7f197f400000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f197f000000, Size: 4194304 calling offload: 7f197f000000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f197ec00000, Size: 4194304 calling offload: 7f197ec00000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f197e800000, Size: 4194304 calling offload: 7f197e800000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.80:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.92:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 0 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7f197f600000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7f197f400000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:96,n_pages:1 Allocation successful! Allocated buffer: 7f197e600000, Size: 96 CoyoteBuffer contructor called! page_size:2097152, buffer_size:192,n_pages:1 Allocation successful! Allocated buffer: 7f197e400000, Size: 192 Gather data from 0... Free user buffer from cProc cPid:0, buffer_size:96,7f197e600000 Free user buffer from cProc cPid:0, buffer_size:192,7f197e400000 Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.80:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.92:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 1, -> outbound seq number 0 CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7f197f600000 status: ENQUEUED occupancy: 96/64 MPI tag: ffffffff seq: 0 src: 1 Spare RX Buffer 1: address: 0x7f197f400000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Removing CCLO object at 0 Doing a soft reset Free user buffer from cProc cPid:0, buffer_size:64,7f197f600000 Free user buffer from cProc cPid:0, buffer_size:64,7f197f400000 Free user buffer from cProc cPid:0, buffer_size:4194304,7f197f000000 Free user buffer from cProc cPid:0, buffer_size:4194304,7f197ec00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7f197e80000 ```

Rank 1

stdout ``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '7' '-c' '24' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:24 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 1] rank 1 size 2 alveo-u55c-07.inf.ethz.ch Testing ACCL base functionality... 10.253.74.80 10.253.74.92 Initializing QP connections... Exchanging QP... Local rank 1 receiving remote QP from remote rank 0 Local rank 1 sending local QP to remote rank 0 Queue Pair: id: 0 Local Queue: local: QPN 0x000001, PSN 0x6f7034, VADDR 00007f431ce00000, SIZE 00200000, IP 0x0afd4a5c, Remote Queue: remote: QPN 0x000002, PSN 0x2aec2a, VADDR 00007f1980200000, SIZE 00200000, IP 0x0afd4a50, rank: 1 FPGA IP: afd4a5c Rendezvous Protocol sw nop time [us]:73.61 hw nop time [ns]:940 Start gather test with root 0... Repetition 0 Pass accl barrier host measured durationUs:91.063 ACCL base functionality test completed successfully! -- STATISTICS - ID: 0 ----------------------------------------------- Read command FIFO used: 0 Write command FIFO used: 0 Host reads sent: 1 Host writes sent: 0 Card reads sent: 0 Card writes sent: 0 Sync reads sent: 5 Sync writes sent: 0 Page faults: 0 -- NET STATS QSFP0 RX pkgs: 48 TX pkgs: 5 ARP RX pkgs: 2 ARP TX pkgs: 2 ICMP RX pkgs: 0 ICMP TX pkgs: 0 TCP RX pkgs: 0 TCP TX pkgs: 0 ROCE RX pkgs: 3 ROCE TX pkgs: 3 IBV RX pkgs: 4 IBV TX pkgs: 6 PSN drop cnt: 0 Retrans cnt: 0 TCP session cnt: 0 STRM down: 0 Finalizing MPI... Done. Terminating... ```
stderr ``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 286334 UID: 500207 [Wed May 29 21:24:18 2024 GMT] HOST: alveo-u55c-07.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 CCLO HWID: 4147289406 at 0x0 CCLO source commit (first 24b): f7329d CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7f431c000000, Size: 64 calling offload: 7f431c000000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7f4317e00000, Size: 64 calling offload: 7f4317e00000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f4317a00000, Size: 4194304 calling offload: 7f4317a00000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f4317600000, Size: 4194304 calling offload: 7f4317600000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f4317200000, Size: 4194304 calling offload: 7f4317200000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 1 number of ranks: 2 > rank 0 (ip 10.253.74.80:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 1 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7f431c000000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7f4317e00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:96,n_pages:1 Allocation successful! Allocated buffer: 7f4317000000, Size: 96 Gather data from 1... Free user buffer from cProc cPid:0, buffer_size:96,7f4317000000 Communicator 0 (0x40): local rank: 1 number of ranks: 2 > rank 0 (ip 10.253.74.80:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 1 > rank 1 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7f431c000000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7f4317e00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Removing CCLO object at 0 Doing a soft reset Free user buffer from cProc cPid:0, buffer_size:64,7f431c000000 Free user buffer from cProc cPid:0, buffer_size:64,7f4317e00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7f4317a00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7f4317600000 Free user buffer from cProc cPid:0, buffer_size:4194304,7f4317200000 ```
quetric commented 3 months ago

I've merged the 196 dev branch into dev since the bugs solved there were quite severe. So I'll be handling this as a new bug on dev.

lawirz commented 2 months ago

I verified my statement about recompiling test.cpp.

When running on the driver and the bitstream generated by the dev branch and only recompiling test.cpp:

quetric commented 2 months ago

@lawirz can you confirm what count you were using for the above failed test?

lawirz commented 2 months ago

The default count of 16

Count of 24

quetric commented 2 months ago

I see you set max eager size to 64B so this 24-float (192B) gather executes with rendezvous. Can you please increase the max eager size to something larger than 192B, and rerun the test? Let me know if the problem persists.

lawirz commented 2 months ago

I now initialized using: accl.get()->initialize(ranks, mpi_rank, mpi_size, 64, 1024, options.seg_size); I still get the error. Maybe it's sheer chance, but I had to repeat it 7 times to produce the error. Typically I got it in the first run before, so it might be dependent on other factors. I only ran the test on the hls code compatibility with Vitis 2023+ commit around 5 times, so I'm not 100% sure it always works there. Should I try it a few times more there to make sure?

bo3z commented 2 months ago

Can you run each 10 times and report how many fail? Lucian and I had a look in the two versions you are pointing and there seems to be nothing that had changed (I fixed the TCP session handler) that could cause this break.

lawirz commented 2 months ago

Results(1 means test succeeded):

accl.get()->initialize(ranks, mpi_rank, mpi_size, 64, 1024, options.seg_size);

The result seems to be dependent on utilization. I am currently almost alone on the cluster and this is the first time I'm getting this amount of successes.

The behaviour I initially observed might just have been due to this effect.

The count was still 24.

I tried to avoid false negatives due to filesystem errors.

The script I used:

for i in {1..10};
do
    echo "6 7" | ./run.sh &> /dev/null
    sleep 20
    grep ".*ACCL base functionality test completed successfully.*" accl_log/rank_0_M_7_N_24_H_1_P_1_stdout | wc;
    if grep -q ".*ERROR: ACCL base functionality test failed.*" accl_log/rank_0_M_7_N_24_H_1_P_1_stdout; then
    echo "ERROR found"
    fi
done
bo3z commented 2 months ago

Thanks for running these - I don't think this is a utilisation / congestion issue though. The networking stacks ACCL uses (RDMA from Coyote, TCP/IP from EasyNet) both have retransmission if I am not mistaken, so any congestion in the cluster should not cause these issues. Also, such issues are quite low level, so I would expect them to also have an impact on other collectives, not just gather. Could this be some race condition?