Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

Reduce/Allreduce issues on cyt_rdma #196

Closed lawirz closed 5 months ago

lawirz commented 6 months ago

I'm getting errors on repeated runs of Reduce and Allreduce:

...
Pass accl barrier
host measured durationUs:91.724
2th item is incorrect! (1.000000 != 2.000000)
3th item is incorrect! (2.000000 != 4.000000)
4th item is incorrect! (3.000000 != 6.000000)
5th item is incorrect! (4.000000 != 8.000000)
...

The first run succeeds.

I'm using a sligthly modified version of the script https://github.com/Xilinx/ACCL/blob/dev/test/host/Coyote/run_scripts/run.sh on commit https://github.com/Xilinx/ACCL/commit/a0ba7ea3040b026359d5f790acb7f83b67e29645

Output of first run(allreduce):

stdout ``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '10' '-c' '512' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:512 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch Testing ACCL base functionality... 10.253.74.92 10.253.74.96 Initializing QP connections... Exchanging QP... Local rank 0 sending local QP to remote rank 1 Local rank 0 receiving remote QP from remote rank 1 Queue Pair: id: 1 Local Queue: local: QPN 0x000002, PSN 0x13da2b, VADDR 00007fe623800000, SIZE 00200000, IP 0x0afd4a5c, Remote Queue: remote: QPN 0x000001, PSN 0x46045d, VADDR 00007f4951e00000, SIZE 00200000, IP 0x0afd4a60, rank: 0 FPGA IP: afd4a5c Rendezvous Protocol sw nop time [us]:106.15 hw nop time [ns]:940 Start allreduce test and reduce function 0... Repetition 0 Pass accl barrier host measured durationUs:410.125 Test is successful! ACCL base functionality test completed successfully! -- STATISTICS - ID: 0 ----------------------------------------------- Read command FIFO used: 0 Write command FIFO used: 0 Host reads sent: 96 Host writes sent: 64 Card reads sent: 64 Card writes sent: 64 Sync reads sent: 5 Sync writes sent: 0 Page faults: 0 -- NET STATS QSFP0 RX pkgs: 320 TX pkgs: 132 ARP RX pkgs: 2 ARP TX pkgs: 2 ICMP RX pkgs: 0 ICMP TX pkgs: 0 TCP RX pkgs: 0 TCP TX pkgs: 0 ROCE RX pkgs: 152 ROCE TX pkgs: 130 IBV RX pkgs: 195 IBV TX pkgs: 195 PSN drop cnt: 0 Retrans cnt: 0 TCP session cnt: 0 STRM down: 0 Finalizing MPI... Done. Terminating... ```
stderr ``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 21032 UID: 500207 [Tue May 14 12:09:44 2024 GMT] HOST: alveo-u55c-07.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 CCLO HWID: 2696576574 at 0x0 CCLO source commit (first 24b): a0ba7e CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fe622c00000, Size: 64 calling offload: 7fe622c00000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fe622a00000, Size: 64 calling offload: 7fe622a00000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe622600000, Size: 4194304 calling offload: 7fe622600000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe622200000, Size: 4194304 calling offload: 7fe622200000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe621e00000, Size: 4194304 calling offload: 7fe621e00000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 0 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fe622c00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fe622a00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1 Allocation successful! Allocated buffer: 7fe621c00000, Size: 2048 CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1 Allocation successful! Allocated buffer: 7fe621a00000, Size: 2048 Reducing data... Free user buffer from cProc cPid:0, buffer_size:2048,7fe621c00000 Free user buffer from cProc cPid:0, buffer_size:2048,7fe621a00000 Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 64, -> outbound seq number 64 CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fe622c00000 status: ENQUEUED occupancy: 32/64 MPI tag: ffffffff seq: 62 src: 1 Spare RX Buffer 1: address: 0x7fe622a00000 status: ENQUEUED occupancy: 32/64 MPI tag: ffffffff seq: 63 src: 1 Removing CCLO object at 0 Doing a soft reset Free user buffer from cProc cPid:0, buffer_size:64,7fe622c00000 Free user buffer from cProc cPid:0, buffer_size:64,7fe622a00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe622600000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe622200000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe621e00000 ```

Second run:

stdout ``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '10' '-c' '512' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:512 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch Testing ACCL base functionality... 10.253.74.92 10.253.74.96 Initializing QP connections... Exchanging QP... Local rank 0 sending local QP to remote rank 1 Local rank 0 receiving remote QP from remote rank 1 Queue Pair: id: 1 Local Queue: local: QPN 0x000001, PSN 0x709eba, VADDR 00007fe2c6c00000, SIZE 00200000, IP 0x0afd4a5c, Remote Queue: remote: QPN 0x000002, PSN 0xd26325, VADDR 00007f9402e00000, SIZE 00200000, IP 0x0afd4a60, rank: 0 FPGA IP: afd4a5c Rendezvous Protocol sw nop time [us]:87.215 hw nop time [ns]:940 Start allreduce test and reduce function 0... Repetition 0 Pass accl barrier host measured durationUs:79.17 2th item is incorrect! (1.000000 != 2.000000) 3th item is incorrect! (2.000000 != 4.000000) 4th item is incorrect! (3.000000 != 6.000000) 5th item is incorrect! (4.000000 != 8.000000) 6th item is incorrect! (5.000000 != 10.000000) 7th item is incorrect! (6.000000 != 12.000000) 8th item is incorrect! (7.000000 != 14.000000) 9th item is incorrect! (8.000000 != 16.000000) 10th item is incorrect! (9.000000 != 18.000000) 11th item is incorrect! (10.000000 != 20.000000) 12th item is incorrect! (11.000000 != 22.000000) 13th item is incorrect! (12.000000 != 24.000000) 14th item is incorrect! (13.000000 != 26.000000) 15th item is incorrect! (14.000000 != 28.000000) 16th item is incorrect! (15.000000 != 30.000000) 17th item is incorrect! (16.000000 != 32.000000) 18th item is incorrect! (17.000000 != 34.000000) 19th item is incorrect! (18.000000 != 36.000000) 20th item is incorrect! (19.000000 != 38.000000) 21th item is incorrect! (20.000000 != 40.000000) 22th item is incorrect! (21.000000 != 42.000000) 23th item is incorrect! (22.000000 != 44.000000) 24th item is incorrect! (23.000000 != 46.000000) 25th item is incorrect! (24.000000 != 48.000000) 26th item is incorrect! (25.000000 != 50.000000) 27th item is incorrect! (26.000000 != 52.000000) 28th item is incorrect! (27.000000 != 54.000000) 29th item is incorrect! (28.000000 != 56.000000) 30th item is incorrect! (29.000000 != 58.000000) 31th item is incorrect! (30.000000 != 60.000000) 32th item is incorrect! (31.000000 != 62.000000) 33th item is incorrect! (32.000000 != 64.000000) 34th item is incorrect! (33.000000 != 66.000000) 35th item is incorrect! (34.000000 != 68.000000) 36th item is incorrect! (35.000000 != 70.000000) 37th item is incorrect! (36.000000 != 72.000000) 38th item is incorrect! (37.000000 != 74.000000) 39th item is incorrect! (38.000000 != 76.000000) 40th item is incorrect! (39.000000 != 78.000000) 41th item is incorrect! (40.000000 != 80.000000) 42th item is incorrect! (41.000000 != 82.000000) 43th item is incorrect! (42.000000 != 84.000000) 44th item is incorrect! (43.000000 != 86.000000) 45th item is incorrect! (44.000000 != 88.000000) 46th item is incorrect! (45.000000 != 90.000000) 47th item is incorrect! (46.000000 != 92.000000) 48th item is incorrect! (47.000000 != 94.000000) 49th item is incorrect! (48.000000 != 96.000000) 50th item is incorrect! (49.000000 != 98.000000) 51th item is incorrect! (50.000000 != 100.000000) 52th item is incorrect! (51.000000 != 102.000000) 53th item is incorrect! (52.000000 != 104.000000) 54th item is incorrect! (53.000000 != 106.000000) 55th item is incorrect! (54.000000 != 108.000000) 56th item is incorrect! (55.000000 != 110.000000) 57th item is incorrect! (56.000000 != 112.000000) 58th item is incorrect! (57.000000 != 114.000000) 59th item is incorrect! (58.000000 != 116.000000) 60th item is incorrect! (59.000000 != 118.000000) 61th item is incorrect! (60.000000 != 120.000000) 62th item is incorrect! (61.000000 != 122.000000) 63th item is incorrect! (62.000000 != 124.000000) 64th item is incorrect! (63.000000 != 126.000000) 65th item is incorrect! (64.000000 != 128.000000) 66th item is incorrect! (65.000000 != 130.000000) 67th item is incorrect! (66.000000 != 132.000000) 68th item is incorrect! (67.000000 != 134.000000) 69th item is incorrect! (68.000000 != 136.000000) 70th item is incorrect! (69.000000 != 138.000000) 71th item is incorrect! (70.000000 != 140.000000) 72th item is incorrect! (71.000000 != 142.000000) 73th item is incorrect! (72.000000 != 144.000000) 74th item is incorrect! (73.000000 != 146.000000) 75th item is incorrect! (74.000000 != 148.000000) 76th item is incorrect! (75.000000 != 150.000000) 77th item is incorrect! (76.000000 != 152.000000) 78th item is incorrect! (77.000000 != 154.000000) 79th item is incorrect! (78.000000 != 156.000000) 80th item is incorrect! (79.000000 != 158.000000) 81th item is incorrect! (80.000000 != 160.000000) 82th item is incorrect! (81.000000 != 162.000000) 83th item is incorrect! (82.000000 != 164.000000) 84th item is incorrect! (83.000000 != 166.000000) 85th item is incorrect! (84.000000 != 168.000000) 86th item is incorrect! (85.000000 != 170.000000) 87th item is incorrect! (86.000000 != 172.000000) 88th item is incorrect! (87.000000 != 174.000000) 89th item is incorrect! (88.000000 != 176.000000) 90th item is incorrect! (89.000000 != 178.000000) 91th item is incorrect! (90.000000 != 180.000000) 92th item is incorrect! (91.000000 != 182.000000) 93th item is incorrect! (92.000000 != 184.000000) 94th item is incorrect! (93.000000 != 186.000000) 95th item is incorrect! (94.000000 != 188.000000) 96th item is incorrect! (95.000000 != 190.000000) 97th item is incorrect! (96.000000 != 192.000000) 98th item is incorrect! (97.000000 != 194.000000) 99th item is incorrect! (98.000000 != 196.000000) 100th item is incorrect! (99.000000 != 198.000000) 101th item is incorrect! (100.000000 != 200.000000) 102th item is incorrect! (101.000000 != 202.000000) 103th item is incorrect! (102.000000 != 204.000000) 104th item is incorrect! (103.000000 != 206.000000) 105th item is incorrect! (104.000000 != 208.000000) 106th item is incorrect! (105.000000 != 210.000000) 107th item is incorrect! (106.000000 != 212.000000) 108th item is incorrect! (107.000000 != 214.000000) 109th item is incorrect! (108.000000 != 216.000000) 110th item is incorrect! (109.000000 != 218.000000) 111th item is incorrect! (110.000000 != 220.000000) 112th item is incorrect! (111.000000 != 222.000000) 113th item is incorrect! (112.000000 != 224.000000) 114th item is incorrect! (113.000000 != 226.000000) 115th item is incorrect! (114.000000 != 228.000000) 116th item is incorrect! (115.000000 != 230.000000) 117th item is incorrect! (116.000000 != 232.000000) 118th item is incorrect! (117.000000 != 234.000000) 119th item is incorrect! (118.000000 != 236.000000) 120th item is incorrect! (119.000000 != 238.000000) 121th item is incorrect! (120.000000 != 240.000000) 122th item is incorrect! (121.000000 != 242.000000) 123th item is incorrect! (122.000000 != 244.000000) 124th item is incorrect! (123.000000 != 246.000000) 125th item is incorrect! (124.000000 != 248.000000) 126th item is incorrect! (125.000000 != 250.000000) 127th item is incorrect! (126.000000 != 252.000000) 128th item is incorrect! (127.000000 != 254.000000) 129th item is incorrect! (128.000000 != 256.000000) 130th item is incorrect! (129.000000 != 258.000000) 131th item is incorrect! (130.000000 != 260.000000) 132th item is incorrect! (131.000000 != 262.000000) 133th item is incorrect! (132.000000 != 264.000000) 134th item is incorrect! (133.000000 != 266.000000) 135th item is incorrect! (134.000000 != 268.000000) 136th item is incorrect! (135.000000 != 270.000000) 137th item is incorrect! (136.000000 != 272.000000) 138th item is incorrect! (137.000000 != 274.000000) 139th item is incorrect! (138.000000 != 276.000000) 140th item is incorrect! (139.000000 != 278.000000) 141th item is incorrect! (140.000000 != 280.000000) 142th item is incorrect! (141.000000 != 282.000000) 143th item is incorrect! (142.000000 != 284.000000) 144th item is incorrect! (143.000000 != 286.000000) 145th item is incorrect! (144.000000 != 288.000000) 146th item is incorrect! (145.000000 != 290.000000) 147th item is incorrect! (146.000000 != 292.000000) 148th item is incorrect! (147.000000 != 294.000000) 149th item is incorrect! (148.000000 != 296.000000) 150th item is incorrect! (149.000000 != 298.000000) 151th item is incorrect! (150.000000 != 300.000000) 152th item is incorrect! (151.000000 != 302.000000) 153th item is incorrect! (152.000000 != 304.000000) 154th item is incorrect! (153.000000 != 306.000000) 155th item is incorrect! (154.000000 != 308.000000) 156th item is incorrect! (155.000000 != 310.000000) 157th item is incorrect! (156.000000 != 312.000000) 158th item is incorrect! (157.000000 != 314.000000) 159th item is incorrect! (158.000000 != 316.000000) 160th item is incorrect! (159.000000 != 318.000000) 161th item is incorrect! (160.000000 != 320.000000) 162th item is incorrect! (161.000000 != 322.000000) 163th item is incorrect! (162.000000 != 324.000000) 164th item is incorrect! (163.000000 != 326.000000) 165th item is incorrect! (164.000000 != 328.000000) 166th item is incorrect! (165.000000 != 330.000000) 167th item is incorrect! (166.000000 != 332.000000) 168th item is incorrect! (167.000000 != 334.000000) 169th item is incorrect! (168.000000 != 336.000000) 170th item is incorrect! (169.000000 != 338.000000) 171th item is incorrect! (170.000000 != 340.000000) 172th item is incorrect! (171.000000 != 342.000000) 173th item is incorrect! (172.000000 != 344.000000) 174th item is incorrect! (173.000000 != 346.000000) 175th item is incorrect! (174.000000 != 348.000000) 176th item is incorrect! (175.000000 != 350.000000) 177th item is incorrect! (176.000000 != 352.000000) 178th item is incorrect! (177.000000 != 354.000000) 179th item is incorrect! (178.000000 != 356.000000) 180th item is incorrect! (179.000000 != 358.000000) 181th item is incorrect! (180.000000 != 360.000000) 182th item is incorrect! (181.000000 != 362.000000) 183th item is incorrect! (182.000000 != 364.000000) 184th item is incorrect! (183.000000 != 366.000000) 185th item is incorrect! (184.000000 != 368.000000) 186th item is incorrect! (185.000000 != 370.000000) 187th item is incorrect! (186.000000 != 372.000000) 188th item is incorrect! (187.000000 != 374.000000) 189th item is incorrect! (188.000000 != 376.000000) 190th item is incorrect! (189.000000 != 378.000000) 191th item is incorrect! (190.000000 != 380.000000) 192th item is incorrect! (191.000000 != 382.000000) 193th item is incorrect! (192.000000 != 384.000000) 194th item is incorrect! (193.000000 != 386.000000) 195th item is incorrect! (194.000000 != 388.000000) 196th item is incorrect! (195.000000 != 390.000000) 197th item is incorrect! (196.000000 != 392.000000) 198th item is incorrect! (197.000000 != 394.000000) 199th item is incorrect! (198.000000 != 396.000000) 200th item is incorrect! (199.000000 != 398.000000) 201th item is incorrect! (200.000000 != 400.000000) 202th item is incorrect! (201.000000 != 402.000000) 203th item is incorrect! (202.000000 != 404.000000) 204th item is incorrect! (203.000000 != 406.000000) 205th item is incorrect! (204.000000 != 408.000000) 206th item is incorrect! (205.000000 != 410.000000) 207th item is incorrect! (206.000000 != 412.000000) 208th item is incorrect! (207.000000 != 414.000000) 209th item is incorrect! (208.000000 != 416.000000) 210th item is incorrect! (209.000000 != 418.000000) 211th item is incorrect! (210.000000 != 420.000000) 212th item is incorrect! (211.000000 != 422.000000) 213th item is incorrect! (212.000000 != 424.000000) 214th item is incorrect! (213.000000 != 426.000000) 215th item is incorrect! (214.000000 != 428.000000) 216th item is incorrect! (215.000000 != 430.000000) 217th item is incorrect! (216.000000 != 432.000000) 218th item is incorrect! (217.000000 != 434.000000) 219th item is incorrect! (218.000000 != 436.000000) 220th item is incorrect! (219.000000 != 438.000000) 221th item is incorrect! (220.000000 != 440.000000) 222th item is incorrect! (221.000000 != 442.000000) 223th item is incorrect! (222.000000 != 444.000000) 224th item is incorrect! (223.000000 != 446.000000) 225th item is incorrect! (224.000000 != 448.000000) 226th item is incorrect! (225.000000 != 450.000000) 227th item is incorrect! (226.000000 != 452.000000) 228th item is incorrect! (227.000000 != 454.000000) 229th item is incorrect! (228.000000 != 456.000000) 230th item is incorrect! (229.000000 != 458.000000) 231th item is incorrect! (230.000000 != 460.000000) 232th item is incorrect! (231.000000 != 462.000000) 233th item is incorrect! (232.000000 != 464.000000) 234th item is incorrect! (233.000000 != 466.000000) 235th item is incorrect! (234.000000 != 468.000000) 236th item is incorrect! (235.000000 != 470.000000) 237th item is incorrect! (236.000000 != 472.000000) 238th item is incorrect! (237.000000 != 474.000000) 239th item is incorrect! (238.000000 != 476.000000) 240th item is incorrect! (239.000000 != 478.000000) 241th item is incorrect! (240.000000 != 480.000000) 242th item is incorrect! (241.000000 != 482.000000) 243th item is incorrect! (242.000000 != 484.000000) 244th item is incorrect! (243.000000 != 486.000000) 245th item is incorrect! (244.000000 != 488.000000) 246th item is incorrect! (245.000000 != 490.000000) 247th item is incorrect! (246.000000 != 492.000000) 248th item is incorrect! (247.000000 != 494.000000) 249th item is incorrect! (248.000000 != 496.000000) 250th item is incorrect! (249.000000 != 498.000000) 251th item is incorrect! (250.000000 != 500.000000) 252th item is incorrect! (251.000000 != 502.000000) 253th item is incorrect! (252.000000 != 504.000000) 254th item is incorrect! (253.000000 != 506.000000) 255th item is incorrect! (254.000000 != 508.000000) 256th item is incorrect! (255.000000 != 510.000000) 257th item is incorrect! (256.000000 != 512.000000) 258th item is incorrect! (257.000000 != 514.000000) 259th item is incorrect! (258.000000 != 516.000000) 260th item is incorrect! (259.000000 != 518.000000) 261th item is incorrect! (260.000000 != 520.000000) 262th item is incorrect! (261.000000 != 522.000000) 263th item is incorrect! (262.000000 != 524.000000) 264th item is incorrect! (263.000000 != 526.000000) 265th item is incorrect! (264.000000 != 528.000000) 266th item is incorrect! (265.000000 != 530.000000) 267th item is incorrect! (266.000000 != 532.000000) 268th item is incorrect! (267.000000 != 534.000000) 269th item is incorrect! (268.000000 != 536.000000) 270th item is incorrect! (269.000000 != 538.000000) 271th item is incorrect! (270.000000 != 540.000000) 272th item is incorrect! (271.000000 != 542.000000) 273th item is incorrect! (272.000000 != 544.000000) 274th item is incorrect! (273.000000 != 546.000000) 275th item is incorrect! (274.000000 != 548.000000) 276th item is incorrect! (275.000000 != 550.000000) 277th item is incorrect! (276.000000 != 552.000000) 278th item is incorrect! (277.000000 != 554.000000) 279th item is incorrect! (278.000000 != 556.000000) 280th item is incorrect! (279.000000 != 558.000000) 281th item is incorrect! (280.000000 != 560.000000) 282th item is incorrect! (281.000000 != 562.000000) 283th item is incorrect! (282.000000 != 564.000000) 284th item is incorrect! (283.000000 != 566.000000) 285th item is incorrect! (284.000000 != 568.000000) 286th item is incorrect! (285.000000 != 570.000000) 287th item is incorrect! (286.000000 != 572.000000) 288th item is incorrect! (287.000000 != 574.000000) 289th item is incorrect! (288.000000 != 576.000000) 290th item is incorrect! (289.000000 != 578.000000) 291th item is incorrect! (290.000000 != 580.000000) 292th item is incorrect! (291.000000 != 582.000000) 293th item is incorrect! (292.000000 != 584.000000) 294th item is incorrect! (293.000000 != 586.000000) 295th item is incorrect! (294.000000 != 588.000000) 296th item is incorrect! (295.000000 != 590.000000) 297th item is incorrect! (296.000000 != 592.000000) 298th item is incorrect! (297.000000 != 594.000000) 299th item is incorrect! (298.000000 != 596.000000) 300th item is incorrect! (299.000000 != 598.000000) 301th item is incorrect! (300.000000 != 600.000000) 302th item is incorrect! (301.000000 != 602.000000) 303th item is incorrect! (302.000000 != 604.000000) 304th item is incorrect! (303.000000 != 606.000000) 305th item is incorrect! (304.000000 != 608.000000) 306th item is incorrect! (305.000000 != 610.000000) 307th item is incorrect! (306.000000 != 612.000000) 308th item is incorrect! (307.000000 != 614.000000) 309th item is incorrect! (308.000000 != 616.000000) 310th item is incorrect! (309.000000 != 618.000000) 311th item is incorrect! (310.000000 != 620.000000) 312th item is incorrect! (311.000000 != 622.000000) 313th item is incorrect! (312.000000 != 624.000000) 314th item is incorrect! (313.000000 != 626.000000) 315th item is incorrect! (314.000000 != 628.000000) 316th item is incorrect! (315.000000 != 630.000000) 317th item is incorrect! (316.000000 != 632.000000) 318th item is incorrect! (317.000000 != 634.000000) 319th item is incorrect! (318.000000 != 636.000000) 320th item is incorrect! (319.000000 != 638.000000) 321th item is incorrect! (320.000000 != 640.000000) 322th item is incorrect! (321.000000 != 642.000000) 323th item is incorrect! (322.000000 != 644.000000) 324th item is incorrect! (323.000000 != 646.000000) 325th item is incorrect! (324.000000 != 648.000000) 326th item is incorrect! (325.000000 != 650.000000) 327th item is incorrect! (326.000000 != 652.000000) 328th item is incorrect! (327.000000 != 654.000000) 329th item is incorrect! (328.000000 != 656.000000) 330th item is incorrect! (329.000000 != 658.000000) 331th item is incorrect! (330.000000 != 660.000000) 332th item is incorrect! (331.000000 != 662.000000) 333th item is incorrect! (332.000000 != 664.000000) 334th item is incorrect! (333.000000 != 666.000000) 335th item is incorrect! (334.000000 != 668.000000) 336th item is incorrect! (335.000000 != 670.000000) 337th item is incorrect! (336.000000 != 672.000000) 338th item is incorrect! (337.000000 != 674.000000) 339th item is incorrect! (338.000000 != 676.000000) 340th item is incorrect! (339.000000 != 678.000000) 341th item is incorrect! (340.000000 != 680.000000) 342th item is incorrect! (341.000000 != 682.000000) 343th item is incorrect! (342.000000 != 684.000000) 344th item is incorrect! (343.000000 != 686.000000) 345th item is incorrect! (344.000000 != 688.000000) 346th item is incorrect! (345.000000 != 690.000000) 347th item is incorrect! (346.000000 != 692.000000) 348th item is incorrect! (347.000000 != 694.000000) 349th item is incorrect! (348.000000 != 696.000000) 350th item is incorrect! (349.000000 != 698.000000) 351th item is incorrect! (350.000000 != 700.000000) 352th item is incorrect! (351.000000 != 702.000000) 353th item is incorrect! (352.000000 != 704.000000) 354th item is incorrect! (353.000000 != 706.000000) 355th item is incorrect! (354.000000 != 708.000000) 356th item is incorrect! (355.000000 != 710.000000) 357th item is incorrect! (356.000000 != 712.000000) 358th item is incorrect! (357.000000 != 714.000000) 359th item is incorrect! (358.000000 != 716.000000) 360th item is incorrect! (359.000000 != 718.000000) 361th item is incorrect! (360.000000 != 720.000000) 362th item is incorrect! (361.000000 != 722.000000) 363th item is incorrect! (362.000000 != 724.000000) 364th item is incorrect! (363.000000 != 726.000000) 365th item is incorrect! (364.000000 != 728.000000) 366th item is incorrect! (365.000000 != 730.000000) 367th item is incorrect! (366.000000 != 732.000000) 368th item is incorrect! (367.000000 != 734.000000) 369th item is incorrect! (368.000000 != 736.000000) 370th item is incorrect! (369.000000 != 738.000000) 371th item is incorrect! (370.000000 != 740.000000) 372th item is incorrect! (371.000000 != 742.000000) 373th item is incorrect! (372.000000 != 744.000000) 374th item is incorrect! (373.000000 != 746.000000) 375th item is incorrect! (374.000000 != 748.000000) 376th item is incorrect! (375.000000 != 750.000000) 377th item is incorrect! (376.000000 != 752.000000) 378th item is incorrect! (377.000000 != 754.000000) 379th item is incorrect! (378.000000 != 756.000000) 380th item is incorrect! (379.000000 != 758.000000) 381th item is incorrect! (380.000000 != 760.000000) 382th item is incorrect! (381.000000 != 762.000000) 383th item is incorrect! (382.000000 != 764.000000) 384th item is incorrect! (383.000000 != 766.000000) 385th item is incorrect! (384.000000 != 768.000000) 386th item is incorrect! (385.000000 != 770.000000) 387th item is incorrect! (386.000000 != 772.000000) 388th item is incorrect! (387.000000 != 774.000000) 389th item is incorrect! (388.000000 != 776.000000) 390th item is incorrect! (389.000000 != 778.000000) 391th item is incorrect! (390.000000 != 780.000000) 392th item is incorrect! (391.000000 != 782.000000) 393th item is incorrect! (392.000000 != 784.000000) 394th item is incorrect! (393.000000 != 786.000000) 395th item is incorrect! (394.000000 != 788.000000) 396th item is incorrect! (395.000000 != 790.000000) 397th item is incorrect! (396.000000 != 792.000000) 398th item is incorrect! (397.000000 != 794.000000) 399th item is incorrect! (398.000000 != 796.000000) 400th item is incorrect! (399.000000 != 798.000000) 401th item is incorrect! (400.000000 != 800.000000) 402th item is incorrect! (401.000000 != 802.000000) 403th item is incorrect! (402.000000 != 804.000000) 404th item is incorrect! (403.000000 != 806.000000) 405th item is incorrect! (404.000000 != 808.000000) 406th item is incorrect! (405.000000 != 810.000000) 407th item is incorrect! (406.000000 != 812.000000) 408th item is incorrect! (407.000000 != 814.000000) 409th item is incorrect! (408.000000 != 816.000000) 410th item is incorrect! (409.000000 != 818.000000) 411th item is incorrect! (410.000000 != 820.000000) 412th item is incorrect! (411.000000 != 822.000000) 413th item is incorrect! (412.000000 != 824.000000) 414th item is incorrect! (413.000000 != 826.000000) 415th item is incorrect! (414.000000 != 828.000000) 416th item is incorrect! (415.000000 != 830.000000) 417th item is incorrect! (416.000000 != 832.000000) 418th item is incorrect! (417.000000 != 834.000000) 419th item is incorrect! (418.000000 != 836.000000) 420th item is incorrect! (419.000000 != 838.000000) 421th item is incorrect! (420.000000 != 840.000000) 422th item is incorrect! (421.000000 != 842.000000) 423th item is incorrect! (422.000000 != 844.000000) 424th item is incorrect! (423.000000 != 846.000000) 425th item is incorrect! (424.000000 != 848.000000) 426th item is incorrect! (425.000000 != 850.000000) 427th item is incorrect! (426.000000 != 852.000000) 428th item is incorrect! (427.000000 != 854.000000) 429th item is incorrect! (428.000000 != 856.000000) 430th item is incorrect! (429.000000 != 858.000000) 431th item is incorrect! (430.000000 != 860.000000) 432th item is incorrect! (431.000000 != 862.000000) 433th item is incorrect! (432.000000 != 864.000000) 434th item is incorrect! (433.000000 != 866.000000) 435th item is incorrect! (434.000000 != 868.000000) 436th item is incorrect! (435.000000 != 870.000000) 437th item is incorrect! (436.000000 != 872.000000) 438th item is incorrect! (437.000000 != 874.000000) 439th item is incorrect! (438.000000 != 876.000000) 440th item is incorrect! (439.000000 != 878.000000) 441th item is incorrect! (440.000000 != 880.000000) 442th item is incorrect! (441.000000 != 882.000000) 443th item is incorrect! (442.000000 != 884.000000) 444th item is incorrect! (443.000000 != 886.000000) 445th item is incorrect! (444.000000 != 888.000000) 446th item is incorrect! (445.000000 != 890.000000) 447th item is incorrect! (446.000000 != 892.000000) 448th item is incorrect! (447.000000 != 894.000000) 449th item is incorrect! (448.000000 != 896.000000) 450th item is incorrect! (449.000000 != 898.000000) 451th item is incorrect! (450.000000 != 900.000000) 452th item is incorrect! (451.000000 != 902.000000) 453th item is incorrect! (452.000000 != 904.000000) 454th item is incorrect! (453.000000 != 906.000000) 455th item is incorrect! (454.000000 != 908.000000) 456th item is incorrect! (455.000000 != 910.000000) 457th item is incorrect! (456.000000 != 912.000000) 458th item is incorrect! (457.000000 != 914.000000) 459th item is incorrect! (458.000000 != 916.000000) 460th item is incorrect! (459.000000 != 918.000000) 461th item is incorrect! (460.000000 != 920.000000) 462th item is incorrect! (461.000000 != 922.000000) 463th item is incorrect! (462.000000 != 924.000000) 464th item is incorrect! (463.000000 != 926.000000) 465th item is incorrect! (464.000000 != 928.000000) 466th item is incorrect! (465.000000 != 930.000000) 467th item is incorrect! (466.000000 != 932.000000) 468th item is incorrect! (467.000000 != 934.000000) 469th item is incorrect! (468.000000 != 936.000000) 470th item is incorrect! (469.000000 != 938.000000) 471th item is incorrect! (470.000000 != 940.000000) 472th item is incorrect! (471.000000 != 942.000000) 473th item is incorrect! (472.000000 != 944.000000) 474th item is incorrect! (473.000000 != 946.000000) 475th item is incorrect! (474.000000 != 948.000000) 476th item is incorrect! (475.000000 != 950.000000) 477th item is incorrect! (476.000000 != 952.000000) 478th item is incorrect! (477.000000 != 954.000000) 479th item is incorrect! (478.000000 != 956.000000) 480th item is incorrect! (479.000000 != 958.000000) 481th item is incorrect! (480.000000 != 960.000000) 482th item is incorrect! (481.000000 != 962.000000) 483th item is incorrect! (482.000000 != 964.000000) 484th item is incorrect! (483.000000 != 966.000000) 485th item is incorrect! (484.000000 != 968.000000) 486th item is incorrect! (485.000000 != 970.000000) 487th item is incorrect! (486.000000 != 972.000000) 488th item is incorrect! (487.000000 != 974.000000) 489th item is incorrect! (488.000000 != 976.000000) 490th item is incorrect! (489.000000 != 978.000000) 491th item is incorrect! (490.000000 != 980.000000) 492th item is incorrect! (491.000000 != 982.000000) 493th item is incorrect! (492.000000 != 984.000000) 494th item is incorrect! (493.000000 != 986.000000) 495th item is incorrect! (494.000000 != 988.000000) 496th item is incorrect! (495.000000 != 990.000000) 497th item is incorrect! (496.000000 != 992.000000) 498th item is incorrect! (497.000000 != 994.000000) 499th item is incorrect! (498.000000 != 996.000000) 500th item is incorrect! (499.000000 != 998.000000) 501th item is incorrect! (500.000000 != 1000.000000) 502th item is incorrect! (501.000000 != 1002.000000) 503th item is incorrect! (502.000000 != 1004.000000) 504th item is incorrect! (503.000000 != 1006.000000) 505th item is incorrect! (504.000000 != 1008.000000) 506th item is incorrect! (505.000000 != 1010.000000) 507th item is incorrect! (506.000000 != 1012.000000) 508th item is incorrect! (507.000000 != 1014.000000) 509th item is incorrect! (508.000000 != 1016.000000) 510th item is incorrect! (509.000000 != 1018.000000) 511th item is incorrect! (510.000000 != 1020.000000) 512th item is incorrect! (511.000000 != 1022.000000) 511 errors! ERROR: ACCL base functionality test failed! -- STATISTICS - ID: 0 ----------------------------------------------- Read command FIFO used: 0 Write command FIFO used: 0 Host reads sent: 98 Host writes sent: 66 Card reads sent: 65 Card writes sent: 64 Sync reads sent: 10 Sync writes sent: 0 Page faults: 0 -- NET STATS QSFP0 RX pkgs: 510 TX pkgs: 138 ARP RX pkgs: 4 ARP TX pkgs: 2 ICMP RX pkgs: 0 ICMP TX pkgs: 0 TCP RX pkgs: 0 TCP TX pkgs: 0 ROCE RX pkgs: 180 ROCE TX pkgs: 136 IBV RX pkgs: 235 IBV TX pkgs: 236 PSN drop cnt: 0 Retrans cnt: 0 TCP session cnt: 0 STRM down: 0 Finalizing MPI... Done. Terminating... ```
stderr ``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 21422 UID: 500207 [Tue May 14 12:11:34 2024 GMT] HOST: alveo-u55c-07.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 CCLO HWID: 2696576574 at 0x0 CCLO source commit (first 24b): a0ba7e CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fe2c6000000, Size: 64 calling offload: 7fe2c6000000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fe2c5e00000, Size: 64 calling offload: 7fe2c5e00000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe2c5a00000, Size: 4194304 calling offload: 7fe2c5a00000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe2c5600000, Size: 4194304 calling offload: 7fe2c5600000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe2c5200000, Size: 4194304 calling offload: 7fe2c5200000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 0 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fe2c6000000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fe2c5e00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1 Allocation successful! Allocated buffer: 7fe2c5000000, Size: 2048 CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1 Allocation successful! Allocated buffer: 7fe2c4e00000, Size: 2048 Reducing data... Free user buffer from cProc cPid:0, buffer_size:2048,7fe2c5000000 Free user buffer from cProc cPid:0, buffer_size:2048,7fe2c4e00000 Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 1 CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fe2c6000000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fe2c5e00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Removing CCLO object at 0 Doing a soft reset Free user buffer from cProc cPid:0, buffer_size:64,7fe2c6000000 Free user buffer from cProc cPid:0, buffer_size:64,7fe2c5e00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe2c5a00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe2c5600000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe2c5200000 ```
quetric commented 6 months ago

@lawirz can you specify what is the threshold for Eager transfers in your ACCL initialization?

lawirz commented 5 months ago

The errors are fixed on the issue branch