Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
80 stars 26 forks source link

Broadcast hangs on cyt_rdma #202

Open lawirz opened 1 month ago

lawirz commented 1 month ago

I observed similar behaviour with other collectives, but thus far only reproduced it with broadcast, so the title may be misleading. I will add comments of similar behaviour with other collectives here later

Calling Broadcast with 4MB hangs on the second rank.

Rank 0

stdout ``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch Testing ACCL base functionality... 10.253.74.92 10.253.74.96 Initializing QP connections... Exchanging QP... Local rank 0 sending local QP to remote rank 1 Local rank 0 receiving remote QP from remote rank 1 Queue Pair: id: 1 Local Queue: local: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c, Remote Queue: remote: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60, rank: 0 FPGA IP: afd4a5c Rendezvous Protocol sw nop time [us]:93.336 hw nop time [ns]:940 Start bcast test with root 0 ... Repetition 0 Pass accl barrier host measured durationUs:252146 ACCL base functionality test completed successfully! -- STATISTICS - ID: 0 ----------------------------------------------- Read command FIFO used: 0 Write command FIFO used: 0 Host reads sent: 1 Host writes sent: 0 Card reads sent: 0 Card writes sent: 0 Sync reads sent: 5 Sync writes sent: 0 Page faults: 0 -- NET STATS QSFP0 RX pkgs: 738 TX pkgs: 1030 ARP RX pkgs: 2 ARP TX pkgs: 2 ICMP RX pkgs: 0 ICMP TX pkgs: 0 TCP RX pkgs: 0 TCP TX pkgs: 0 ROCE RX pkgs: 654 ROCE TX pkgs: 1028 IBV RX pkgs: 646 IBV TX pkgs: 66566 PSN drop cnt: 0 Retrans cnt: 384 TCP session cnt: 0 STRM down: 0 ```
stderr ``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 92386 UID: 500207 [Wed Jun 19 10:50:52 2024 GMT] HOST: alveo-u55c-07.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 CCLO HWID: 3009117246 at 0x0 CCLO source commit (first 24b): b35b7c CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fc95fe00000, Size: 64 calling offload: 7fc95fe00000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fc95fc00000, Size: 64 calling offload: 7fc95fc00000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fc95f800000, Size: 4194304 calling offload: 7fc95f800000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fc95f400000, Size: 4194304 calling offload: 7fc95f400000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fc95f000000, Size: 4194304 calling offload: 7fc95f000000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 0 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fc95fe00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fc95fc00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fc95ec00000, Size: 4194304 Broadcasting data from 0... Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95ec00000 Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fc95fe00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fc95fc00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Removing CCLO object at 0 Doing a soft reset Free user buffer from cProc cPid:0, buffer_size:64,7fc95fe00000 Free user buffer from cProc cPid:0, buffer_size:64,7fc95fc00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f800000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f400000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f000000 ```

Rank 1

stdout ``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 1] rank 1 size 2 alveo-u55c-08.inf.ethz.ch Testing ACCL base functionality... 10.253.74.92 10.253.74.96 Initializing QP connections... Exchanging QP... Local rank 1 receiving remote QP from remote rank 0 Local rank 1 sending local QP to remote rank 0 Queue Pair: id: 0 Local Queue: local: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60, Remote Queue: remote: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c, rank: 1 FPGA IP: afd4a60 Rendezvous Protocol sw nop time [us]:86.834 hw nop time [ns]:940 Start bcast test with root 0 ... Repetition 0 Pass accl barrier ```
stderr ``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 90744 UID: 500207 [Wed Jun 19 10:50:52 2024 GMT] HOST: alveo-u55c-08.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 CCLO HWID: 3009117246 at 0x0 CCLO source commit (first 24b): b35b7c CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7f2da5e00000, Size: 64 calling offload: 7f2da5e00000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7f2da5c00000, Size: 64 calling offload: 7f2da5c00000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f2da5800000, Size: 4194304 calling offload: 7f2da5800000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f2da5400000, Size: 4194304 calling offload: 7f2da5400000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f2da5000000, Size: 4194304 calling offload: 7f2da5000000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 1 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 1 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7f2da5e00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7f2da5c00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f2da4c00000, Size: 4194304 Getting broadcast data from 0... ```

Running smaller Broadcast operations even if above Rendezvous-threshhold works. When I ran with 128 elements(which is above the threshhold), I broke a machine, though(successive bitstream flashing failed), but this might just have been bad luck.

The other collective I experienced issues with is allreduce, there I get hangs too, but this might be completly unrelated.

Generally, the errors seem to occur, at certain sizes or after a certain amount of repetitions. It might just be a delay after which the machine hangs, as I got hangs in instances, where there isn't even an ACCL collective running. This happened in conjunction with allreduce, and I have trouble reproducing it.

I'm running it on the 200-allreduce-hangs... branch, but I had the same behaviour on the 196 merge commit. I'm fairly confident everything worked before the merge of the 196-fix, but I can try to verify it. I certainly was able to run almost all collectives on HW, sometime before I entered the 196 issue merge.

Everything works in Simulator, in a variety of scenarios.

lawirz commented 1 month ago

Can confirm, that I observe similar behaviour when running Allreduce in isolation. I tried to run Allreduce with a size of just 2. The first run succeeded. On the secnd run, then the machine started hanging(Can't even reprogram anymore)

lawirz commented 1 month ago

I can also confirm, that the issues are not present on the commit before the 196 merge. Merge pull request

quetric commented 4 weeks ago

You linked to #194, do you mean that or the PR that closed issue #196 ?

lawirz commented 4 weeks ago

I mean to say they are probably introduced in the 196-fix. The commit right before is the 194 merge(01f49d2), on which the issue is not present.

quetric commented 4 weeks ago

Can you attach your code here? This doesn't look like it's from any of our tests.

lawirz commented 4 weeks ago

It's the test/host/Coyote/runscripts/run.sh with

TEST_MODE=(5) 
N_ELEMENTS=(1048576) # 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576
quetric commented 4 weeks ago

Does this same test work against the emulator?

lawirz commented 4 weeks ago

I didn't try the equivalent as a isolated testcase. But the emulator works with the ProcessGroup with different sizes and repetitions, while in hardware it shows behaviour like this very quickly