Open lawirz opened 5 months ago
Can confirm, that I observe similar behaviour when running Allreduce in isolation. I tried to run Allreduce with a size of just 2. The first run succeeded. On the secnd run, then the machine started hanging(Can't even reprogram anymore)
I can also confirm, that the issues are not present on the commit before the 196 merge. Merge pull request
You linked to #194, do you mean that or the PR that closed issue #196 ?
I mean to say they are probably introduced in the 196-fix. The commit right before is the 194 merge(01f49d2), on which the issue is not present.
Can you attach your code here? This doesn't look like it's from any of our tests.
It's the test/host/Coyote/runscripts/run.sh with
TEST_MODE=(5)
N_ELEMENTS=(1048576) # 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576
Does this same test work against the emulator?
I didn't try the equivalent as a isolated testcase. But the emulator works with the ProcessGroup with different sizes and repetitions, while in hardware it shows behaviour like this very quickly
I observed similar behaviour with other collectives, but thus far only reproduced it with broadcast, so the title may be misleading. I will add comments of similar behaviour with other collectives here later
Calling Broadcast with 4MB hangs on the second rank.
Rank 0
stdout
``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch Testing ACCL base functionality... 10.253.74.92 10.253.74.96 Initializing QP connections... Exchanging QP... Local rank 0 sending local QP to remote rank 1 Local rank 0 receiving remote QP from remote rank 1 Queue Pair: id: 1 Local Queue: local: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c, Remote Queue: remote: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60, rank: 0 FPGA IP: afd4a5c Rendezvous Protocol sw nop time [us]:93.336 hw nop time [ns]:940 Start bcast test with root 0 ... Repetition 0 Pass accl barrier host measured durationUs:252146 ACCL base functionality test completed successfully! -- STATISTICS - ID: 0 ----------------------------------------------- Read command FIFO used: 0 Write command FIFO used: 0 Host reads sent: 1 Host writes sent: 0 Card reads sent: 0 Card writes sent: 0 Sync reads sent: 5 Sync writes sent: 0 Page faults: 0 -- [31m[1mNET STATS[0m[0m QSFP0 RX pkgs: 738 TX pkgs: 1030 ARP RX pkgs: 2 ARP TX pkgs: 2 ICMP RX pkgs: 0 ICMP TX pkgs: 0 TCP RX pkgs: 0 TCP TX pkgs: 0 ROCE RX pkgs: 654 ROCE TX pkgs: 1028 IBV RX pkgs: 646 IBV TX pkgs: 66566 PSN drop cnt: 0 Retrans cnt: 384 TCP session cnt: 0 STRM down: 0 ```stderr
``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 92386 UID: 500207 [Wed Jun 19 10:50:52 2024 GMT] HOST: alveo-u55c-07.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 CCLO HWID: 3009117246 at 0x0 CCLO source commit (first 24b): b35b7c CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fc95fe00000, Size: 64 calling offload: 7fc95fe00000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fc95fc00000, Size: 64 calling offload: 7fc95fc00000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fc95f800000, Size: 4194304 calling offload: 7fc95f800000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fc95f400000, Size: 4194304 calling offload: 7fc95f400000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fc95f000000, Size: 4194304 calling offload: 7fc95f000000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 0 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fc95fe00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fc95fc00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fc95ec00000, Size: 4194304 Broadcasting data from 0... Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95ec00000 Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fc95fe00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fc95fc00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Removing CCLO object at 0 Doing a soft reset Free user buffer from cProc cPid:0, buffer_size:64,7fc95fe00000 Free user buffer from cProc cPid:0, buffer_size:64,7fc95fc00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f800000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f400000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f000000 ```Rank 1
stdout
``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 1] rank 1 size 2 alveo-u55c-08.inf.ethz.ch Testing ACCL base functionality... 10.253.74.92 10.253.74.96 Initializing QP connections... Exchanging QP... Local rank 1 receiving remote QP from remote rank 0 Local rank 1 sending local QP to remote rank 0 Queue Pair: id: 0 Local Queue: local: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60, Remote Queue: remote: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c, rank: 1 FPGA IP: afd4a60 Rendezvous Protocol sw nop time [us]:86.834 hw nop time [ns]:940 Start bcast test with root 0 ... Repetition 0 Pass accl barrier ```stderr
``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 90744 UID: 500207 [Wed Jun 19 10:50:52 2024 GMT] HOST: alveo-u55c-08.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 CCLO HWID: 3009117246 at 0x0 CCLO source commit (first 24b): b35b7c CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7f2da5e00000, Size: 64 calling offload: 7f2da5e00000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7f2da5c00000, Size: 64 calling offload: 7f2da5c00000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f2da5800000, Size: 4194304 calling offload: 7f2da5800000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f2da5400000, Size: 4194304 calling offload: 7f2da5400000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f2da5000000, Size: 4194304 calling offload: 7f2da5000000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 1 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 1 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7f2da5e00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7f2da5c00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7f2da4c00000, Size: 4194304 Getting broadcast data from 0... ```Running smaller Broadcast operations even if above Rendezvous-threshhold works. When I ran with 128 elements(which is above the threshhold), I broke a machine, though(successive bitstream flashing failed), but this might just have been bad luck.
The other collective I experienced issues with is allreduce, there I get hangs too, but this might be completly unrelated.
Generally, the errors seem to occur, at certain sizes or after a certain amount of repetitions. It might just be a delay after which the machine hangs, as I got hangs in instances, where there isn't even an ACCL collective running. This happened in conjunction with allreduce, and I have trouble reproducing it.
I'm running it on the 200-allreduce-hangs... branch, but I had the same behaviour on the 196 merge commit. I'm fairly confident everything worked before the merge of the 196-fix, but I can try to verify it. I certainly was able to run almost all collectives on HW, sometime before I entered the 196 issue merge.
Everything works in Simulator, in a variety of scenarios.