Closed lawirz closed 5 months ago
I'm getting errors on repeated runs of Reduce and Allreduce:
... Pass accl barrier host measured durationUs:91.724 2th item is incorrect! (1.000000 != 2.000000) 3th item is incorrect! (2.000000 != 4.000000) 4th item is incorrect! (3.000000 != 6.000000) 5th item is incorrect! (4.000000 != 8.000000) ...
The first run succeeds.
I'm using a sligthly modified version of the script https://github.com/Xilinx/ACCL/blob/dev/test/host/Coyote/run_scripts/run.sh on commit https://github.com/Xilinx/ACCL/commit/a0ba7ea3040b026359d5f790acb7f83b67e29645
@lawirz can you specify what is the threshold for Eager transfers in your ACCL initialization?
The errors are fixed on the issue branch
I'm getting errors on repeated runs of Reduce and Allreduce:
The first run succeeds.
I'm using a sligthly modified version of the script https://github.com/Xilinx/ACCL/blob/dev/test/host/Coyote/run_scripts/run.sh on commit https://github.com/Xilinx/ACCL/commit/a0ba7ea3040b026359d5f790acb7f83b67e29645
Output of first run(allreduce):
stdout
``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '10' '-c' '512' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:512 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch Testing ACCL base functionality... 10.253.74.92 10.253.74.96 Initializing QP connections... Exchanging QP... Local rank 0 sending local QP to remote rank 1 Local rank 0 receiving remote QP from remote rank 1 Queue Pair: id: 1 Local Queue: local: QPN 0x000002, PSN 0x13da2b, VADDR 00007fe623800000, SIZE 00200000, IP 0x0afd4a5c, Remote Queue: remote: QPN 0x000001, PSN 0x46045d, VADDR 00007f4951e00000, SIZE 00200000, IP 0x0afd4a60, rank: 0 FPGA IP: afd4a5c Rendezvous Protocol sw nop time [us]:106.15 hw nop time [ns]:940 Start allreduce test and reduce function 0... Repetition 0 Pass accl barrier host measured durationUs:410.125 Test is successful! ACCL base functionality test completed successfully! -- STATISTICS - ID: 0 ----------------------------------------------- Read command FIFO used: 0 Write command FIFO used: 0 Host reads sent: 96 Host writes sent: 64 Card reads sent: 64 Card writes sent: 64 Sync reads sent: 5 Sync writes sent: 0 Page faults: 0 -- [31m[1mNET STATS[0m[0m QSFP0 RX pkgs: 320 TX pkgs: 132 ARP RX pkgs: 2 ARP TX pkgs: 2 ICMP RX pkgs: 0 ICMP TX pkgs: 0 TCP RX pkgs: 0 TCP TX pkgs: 0 ROCE RX pkgs: 152 ROCE TX pkgs: 130 IBV RX pkgs: 195 IBV TX pkgs: 195 PSN drop cnt: 0 Retrans cnt: 0 TCP session cnt: 0 STRM down: 0 Finalizing MPI... Done. Terminating... ```stderr
``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 21032 UID: 500207 [Tue May 14 12:09:44 2024 GMT] HOST: alveo-u55c-07.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 CCLO HWID: 2696576574 at 0x0 CCLO source commit (first 24b): a0ba7e CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fe622c00000, Size: 64 calling offload: 7fe622c00000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fe622a00000, Size: 64 calling offload: 7fe622a00000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe622600000, Size: 4194304 calling offload: 7fe622600000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe622200000, Size: 4194304 calling offload: 7fe622200000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe621e00000, Size: 4194304 calling offload: 7fe621e00000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 0 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fe622c00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fe622a00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1 Allocation successful! Allocated buffer: 7fe621c00000, Size: 2048 CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1 Allocation successful! Allocated buffer: 7fe621a00000, Size: 2048 Reducing data... Free user buffer from cProc cPid:0, buffer_size:2048,7fe621c00000 Free user buffer from cProc cPid:0, buffer_size:2048,7fe621a00000 Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 64, -> outbound seq number 64 CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fe622c00000 status: ENQUEUED occupancy: 32/64 MPI tag: ffffffff seq: 62 src: 1 Spare RX Buffer 1: address: 0x7fe622a00000 status: ENQUEUED occupancy: 32/64 MPI tag: ffffffff seq: 63 src: 1 Removing CCLO object at 0 Doing a soft reset Free user buffer from cProc cPid:0, buffer_size:64,7fe622c00000 Free user buffer from cProc cPid:0, buffer_size:64,7fe622a00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe622600000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe622200000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe621e00000 ```Second run:
stdout
``` Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '10' '-c' '512' '-l' './accl_log/fpga' '-p' '1' '-n' '1' Running ACCL test in coyote... Initializing MPI... Reading MPI rank and size values... Parsing options Hardware rdma mode count:512 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2 Getting MPI Processor name... [process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch Testing ACCL base functionality... 10.253.74.92 10.253.74.96 Initializing QP connections... Exchanging QP... Local rank 0 sending local QP to remote rank 1 Local rank 0 receiving remote QP from remote rank 1 Queue Pair: id: 1 Local Queue: local: QPN 0x000001, PSN 0x709eba, VADDR 00007fe2c6c00000, SIZE 00200000, IP 0x0afd4a5c, Remote Queue: remote: QPN 0x000002, PSN 0xd26325, VADDR 00007f9402e00000, SIZE 00200000, IP 0x0afd4a60, rank: 0 FPGA IP: afd4a5c Rendezvous Protocol sw nop time [us]:87.215 hw nop time [ns]:940 Start allreduce test and reduce function 0... Repetition 0 Pass accl barrier host measured durationUs:79.17 2th item is incorrect! (1.000000 != 2.000000) 3th item is incorrect! (2.000000 != 4.000000) 4th item is incorrect! (3.000000 != 6.000000) 5th item is incorrect! (4.000000 != 8.000000) 6th item is incorrect! (5.000000 != 10.000000) 7th item is incorrect! (6.000000 != 12.000000) 8th item is incorrect! (7.000000 != 14.000000) 9th item is incorrect! (8.000000 != 16.000000) 10th item is incorrect! (9.000000 != 18.000000) 11th item is incorrect! (10.000000 != 20.000000) 12th item is incorrect! (11.000000 != 22.000000) 13th item is incorrect! (12.000000 != 24.000000) 14th item is incorrect! (13.000000 != 26.000000) 15th item is incorrect! (14.000000 != 28.000000) 16th item is incorrect! (15.000000 != 30.000000) 17th item is incorrect! (16.000000 != 32.000000) 18th item is incorrect! (17.000000 != 34.000000) 19th item is incorrect! (18.000000 != 36.000000) 20th item is incorrect! (19.000000 != 38.000000) 21th item is incorrect! (20.000000 != 40.000000) 22th item is incorrect! (21.000000 != 42.000000) 23th item is incorrect! (22.000000 != 44.000000) 24th item is incorrect! (23.000000 != 46.000000) 25th item is incorrect! (24.000000 != 48.000000) 26th item is incorrect! (25.000000 != 50.000000) 27th item is incorrect! (26.000000 != 52.000000) 28th item is incorrect! (27.000000 != 54.000000) 29th item is incorrect! (28.000000 != 56.000000) 30th item is incorrect! (29.000000 != 58.000000) 31th item is incorrect! (30.000000 != 60.000000) 32th item is incorrect! (31.000000 != 62.000000) 33th item is incorrect! (32.000000 != 64.000000) 34th item is incorrect! (33.000000 != 66.000000) 35th item is incorrect! (34.000000 != 68.000000) 36th item is incorrect! (35.000000 != 70.000000) 37th item is incorrect! (36.000000 != 72.000000) 38th item is incorrect! (37.000000 != 74.000000) 39th item is incorrect! (38.000000 != 76.000000) 40th item is incorrect! (39.000000 != 78.000000) 41th item is incorrect! (40.000000 != 80.000000) 42th item is incorrect! (41.000000 != 82.000000) 43th item is incorrect! (42.000000 != 84.000000) 44th item is incorrect! (43.000000 != 86.000000) 45th item is incorrect! (44.000000 != 88.000000) 46th item is incorrect! (45.000000 != 90.000000) 47th item is incorrect! (46.000000 != 92.000000) 48th item is incorrect! (47.000000 != 94.000000) 49th item is incorrect! (48.000000 != 96.000000) 50th item is incorrect! (49.000000 != 98.000000) 51th item is incorrect! (50.000000 != 100.000000) 52th item is incorrect! (51.000000 != 102.000000) 53th item is incorrect! (52.000000 != 104.000000) 54th item is incorrect! (53.000000 != 106.000000) 55th item is incorrect! (54.000000 != 108.000000) 56th item is incorrect! (55.000000 != 110.000000) 57th item is incorrect! (56.000000 != 112.000000) 58th item is incorrect! (57.000000 != 114.000000) 59th item is incorrect! (58.000000 != 116.000000) 60th item is incorrect! (59.000000 != 118.000000) 61th item is incorrect! (60.000000 != 120.000000) 62th item is incorrect! (61.000000 != 122.000000) 63th item is incorrect! (62.000000 != 124.000000) 64th item is incorrect! (63.000000 != 126.000000) 65th item is incorrect! (64.000000 != 128.000000) 66th item is incorrect! (65.000000 != 130.000000) 67th item is incorrect! (66.000000 != 132.000000) 68th item is incorrect! (67.000000 != 134.000000) 69th item is incorrect! (68.000000 != 136.000000) 70th item is incorrect! (69.000000 != 138.000000) 71th item is incorrect! (70.000000 != 140.000000) 72th item is incorrect! (71.000000 != 142.000000) 73th item is incorrect! (72.000000 != 144.000000) 74th item is incorrect! (73.000000 != 146.000000) 75th item is incorrect! (74.000000 != 148.000000) 76th item is incorrect! (75.000000 != 150.000000) 77th item is incorrect! (76.000000 != 152.000000) 78th item is incorrect! (77.000000 != 154.000000) 79th item is incorrect! (78.000000 != 156.000000) 80th item is incorrect! (79.000000 != 158.000000) 81th item is incorrect! (80.000000 != 160.000000) 82th item is incorrect! (81.000000 != 162.000000) 83th item is incorrect! (82.000000 != 164.000000) 84th item is incorrect! (83.000000 != 166.000000) 85th item is incorrect! (84.000000 != 168.000000) 86th item is incorrect! (85.000000 != 170.000000) 87th item is incorrect! (86.000000 != 172.000000) 88th item is incorrect! (87.000000 != 174.000000) 89th item is incorrect! (88.000000 != 176.000000) 90th item is incorrect! (89.000000 != 178.000000) 91th item is incorrect! (90.000000 != 180.000000) 92th item is incorrect! (91.000000 != 182.000000) 93th item is incorrect! (92.000000 != 184.000000) 94th item is incorrect! (93.000000 != 186.000000) 95th item is incorrect! (94.000000 != 188.000000) 96th item is incorrect! (95.000000 != 190.000000) 97th item is incorrect! (96.000000 != 192.000000) 98th item is incorrect! (97.000000 != 194.000000) 99th item is incorrect! (98.000000 != 196.000000) 100th item is incorrect! (99.000000 != 198.000000) 101th item is incorrect! (100.000000 != 200.000000) 102th item is incorrect! (101.000000 != 202.000000) 103th item is incorrect! (102.000000 != 204.000000) 104th item is incorrect! (103.000000 != 206.000000) 105th item is incorrect! (104.000000 != 208.000000) 106th item is incorrect! (105.000000 != 210.000000) 107th item is incorrect! (106.000000 != 212.000000) 108th item is incorrect! (107.000000 != 214.000000) 109th item is incorrect! (108.000000 != 216.000000) 110th item is incorrect! (109.000000 != 218.000000) 111th item is incorrect! (110.000000 != 220.000000) 112th item is incorrect! (111.000000 != 222.000000) 113th item is incorrect! (112.000000 != 224.000000) 114th item is incorrect! (113.000000 != 226.000000) 115th item is incorrect! (114.000000 != 228.000000) 116th item is incorrect! (115.000000 != 230.000000) 117th item is incorrect! (116.000000 != 232.000000) 118th item is incorrect! (117.000000 != 234.000000) 119th item is incorrect! (118.000000 != 236.000000) 120th item is incorrect! (119.000000 != 238.000000) 121th item is incorrect! (120.000000 != 240.000000) 122th item is incorrect! (121.000000 != 242.000000) 123th item is incorrect! (122.000000 != 244.000000) 124th item is incorrect! (123.000000 != 246.000000) 125th item is incorrect! (124.000000 != 248.000000) 126th item is incorrect! (125.000000 != 250.000000) 127th item is incorrect! (126.000000 != 252.000000) 128th item is incorrect! (127.000000 != 254.000000) 129th item is incorrect! (128.000000 != 256.000000) 130th item is incorrect! (129.000000 != 258.000000) 131th item is incorrect! (130.000000 != 260.000000) 132th item is incorrect! (131.000000 != 262.000000) 133th item is incorrect! (132.000000 != 264.000000) 134th item is incorrect! (133.000000 != 266.000000) 135th item is incorrect! (134.000000 != 268.000000) 136th item is incorrect! (135.000000 != 270.000000) 137th item is incorrect! (136.000000 != 272.000000) 138th item is incorrect! (137.000000 != 274.000000) 139th item is incorrect! (138.000000 != 276.000000) 140th item is incorrect! (139.000000 != 278.000000) 141th item is incorrect! (140.000000 != 280.000000) 142th item is incorrect! (141.000000 != 282.000000) 143th item is incorrect! (142.000000 != 284.000000) 144th item is incorrect! (143.000000 != 286.000000) 145th item is incorrect! (144.000000 != 288.000000) 146th item is incorrect! (145.000000 != 290.000000) 147th item is incorrect! (146.000000 != 292.000000) 148th item is incorrect! (147.000000 != 294.000000) 149th item is incorrect! (148.000000 != 296.000000) 150th item is incorrect! (149.000000 != 298.000000) 151th item is incorrect! (150.000000 != 300.000000) 152th item is incorrect! (151.000000 != 302.000000) 153th item is incorrect! (152.000000 != 304.000000) 154th item is incorrect! (153.000000 != 306.000000) 155th item is incorrect! (154.000000 != 308.000000) 156th item is incorrect! (155.000000 != 310.000000) 157th item is incorrect! (156.000000 != 312.000000) 158th item is incorrect! (157.000000 != 314.000000) 159th item is incorrect! (158.000000 != 316.000000) 160th item is incorrect! (159.000000 != 318.000000) 161th item is incorrect! (160.000000 != 320.000000) 162th item is incorrect! (161.000000 != 322.000000) 163th item is incorrect! (162.000000 != 324.000000) 164th item is incorrect! (163.000000 != 326.000000) 165th item is incorrect! (164.000000 != 328.000000) 166th item is incorrect! (165.000000 != 330.000000) 167th item is incorrect! (166.000000 != 332.000000) 168th item is incorrect! (167.000000 != 334.000000) 169th item is incorrect! (168.000000 != 336.000000) 170th item is incorrect! (169.000000 != 338.000000) 171th item is incorrect! (170.000000 != 340.000000) 172th item is incorrect! (171.000000 != 342.000000) 173th item is incorrect! (172.000000 != 344.000000) 174th item is incorrect! (173.000000 != 346.000000) 175th item is incorrect! (174.000000 != 348.000000) 176th item is incorrect! (175.000000 != 350.000000) 177th item is incorrect! (176.000000 != 352.000000) 178th item is incorrect! (177.000000 != 354.000000) 179th item is incorrect! (178.000000 != 356.000000) 180th item is incorrect! (179.000000 != 358.000000) 181th item is incorrect! (180.000000 != 360.000000) 182th item is incorrect! (181.000000 != 362.000000) 183th item is incorrect! (182.000000 != 364.000000) 184th item is incorrect! (183.000000 != 366.000000) 185th item is incorrect! (184.000000 != 368.000000) 186th item is incorrect! (185.000000 != 370.000000) 187th item is incorrect! (186.000000 != 372.000000) 188th item is incorrect! (187.000000 != 374.000000) 189th item is incorrect! (188.000000 != 376.000000) 190th item is incorrect! (189.000000 != 378.000000) 191th item is incorrect! (190.000000 != 380.000000) 192th item is incorrect! (191.000000 != 382.000000) 193th item is incorrect! (192.000000 != 384.000000) 194th item is incorrect! (193.000000 != 386.000000) 195th item is incorrect! (194.000000 != 388.000000) 196th item is incorrect! (195.000000 != 390.000000) 197th item is incorrect! (196.000000 != 392.000000) 198th item is incorrect! (197.000000 != 394.000000) 199th item is incorrect! (198.000000 != 396.000000) 200th item is incorrect! (199.000000 != 398.000000) 201th item is incorrect! (200.000000 != 400.000000) 202th item is incorrect! (201.000000 != 402.000000) 203th item is incorrect! (202.000000 != 404.000000) 204th item is incorrect! (203.000000 != 406.000000) 205th item is incorrect! (204.000000 != 408.000000) 206th item is incorrect! (205.000000 != 410.000000) 207th item is incorrect! (206.000000 != 412.000000) 208th item is incorrect! (207.000000 != 414.000000) 209th item is incorrect! (208.000000 != 416.000000) 210th item is incorrect! (209.000000 != 418.000000) 211th item is incorrect! (210.000000 != 420.000000) 212th item is incorrect! (211.000000 != 422.000000) 213th item is incorrect! (212.000000 != 424.000000) 214th item is incorrect! (213.000000 != 426.000000) 215th item is incorrect! (214.000000 != 428.000000) 216th item is incorrect! (215.000000 != 430.000000) 217th item is incorrect! (216.000000 != 432.000000) 218th item is incorrect! (217.000000 != 434.000000) 219th item is incorrect! (218.000000 != 436.000000) 220th item is incorrect! (219.000000 != 438.000000) 221th item is incorrect! (220.000000 != 440.000000) 222th item is incorrect! (221.000000 != 442.000000) 223th item is incorrect! (222.000000 != 444.000000) 224th item is incorrect! (223.000000 != 446.000000) 225th item is incorrect! (224.000000 != 448.000000) 226th item is incorrect! (225.000000 != 450.000000) 227th item is incorrect! (226.000000 != 452.000000) 228th item is incorrect! (227.000000 != 454.000000) 229th item is incorrect! (228.000000 != 456.000000) 230th item is incorrect! (229.000000 != 458.000000) 231th item is incorrect! (230.000000 != 460.000000) 232th item is incorrect! (231.000000 != 462.000000) 233th item is incorrect! (232.000000 != 464.000000) 234th item is incorrect! (233.000000 != 466.000000) 235th item is incorrect! (234.000000 != 468.000000) 236th item is incorrect! (235.000000 != 470.000000) 237th item is incorrect! (236.000000 != 472.000000) 238th item is incorrect! (237.000000 != 474.000000) 239th item is incorrect! (238.000000 != 476.000000) 240th item is incorrect! (239.000000 != 478.000000) 241th item is incorrect! (240.000000 != 480.000000) 242th item is incorrect! (241.000000 != 482.000000) 243th item is incorrect! (242.000000 != 484.000000) 244th item is incorrect! (243.000000 != 486.000000) 245th item is incorrect! (244.000000 != 488.000000) 246th item is incorrect! (245.000000 != 490.000000) 247th item is incorrect! (246.000000 != 492.000000) 248th item is incorrect! (247.000000 != 494.000000) 249th item is incorrect! (248.000000 != 496.000000) 250th item is incorrect! (249.000000 != 498.000000) 251th item is incorrect! (250.000000 != 500.000000) 252th item is incorrect! (251.000000 != 502.000000) 253th item is incorrect! (252.000000 != 504.000000) 254th item is incorrect! (253.000000 != 506.000000) 255th item is incorrect! (254.000000 != 508.000000) 256th item is incorrect! (255.000000 != 510.000000) 257th item is incorrect! (256.000000 != 512.000000) 258th item is incorrect! (257.000000 != 514.000000) 259th item is incorrect! (258.000000 != 516.000000) 260th item is incorrect! (259.000000 != 518.000000) 261th item is incorrect! (260.000000 != 520.000000) 262th item is incorrect! (261.000000 != 522.000000) 263th item is incorrect! (262.000000 != 524.000000) 264th item is incorrect! (263.000000 != 526.000000) 265th item is incorrect! (264.000000 != 528.000000) 266th item is incorrect! (265.000000 != 530.000000) 267th item is incorrect! (266.000000 != 532.000000) 268th item is incorrect! (267.000000 != 534.000000) 269th item is incorrect! (268.000000 != 536.000000) 270th item is incorrect! (269.000000 != 538.000000) 271th item is incorrect! (270.000000 != 540.000000) 272th item is incorrect! (271.000000 != 542.000000) 273th item is incorrect! (272.000000 != 544.000000) 274th item is incorrect! (273.000000 != 546.000000) 275th item is incorrect! (274.000000 != 548.000000) 276th item is incorrect! (275.000000 != 550.000000) 277th item is incorrect! (276.000000 != 552.000000) 278th item is incorrect! (277.000000 != 554.000000) 279th item is incorrect! (278.000000 != 556.000000) 280th item is incorrect! (279.000000 != 558.000000) 281th item is incorrect! (280.000000 != 560.000000) 282th item is incorrect! (281.000000 != 562.000000) 283th item is incorrect! (282.000000 != 564.000000) 284th item is incorrect! (283.000000 != 566.000000) 285th item is incorrect! (284.000000 != 568.000000) 286th item is incorrect! (285.000000 != 570.000000) 287th item is incorrect! (286.000000 != 572.000000) 288th item is incorrect! (287.000000 != 574.000000) 289th item is incorrect! (288.000000 != 576.000000) 290th item is incorrect! (289.000000 != 578.000000) 291th item is incorrect! (290.000000 != 580.000000) 292th item is incorrect! (291.000000 != 582.000000) 293th item is incorrect! (292.000000 != 584.000000) 294th item is incorrect! (293.000000 != 586.000000) 295th item is incorrect! (294.000000 != 588.000000) 296th item is incorrect! (295.000000 != 590.000000) 297th item is incorrect! (296.000000 != 592.000000) 298th item is incorrect! (297.000000 != 594.000000) 299th item is incorrect! (298.000000 != 596.000000) 300th item is incorrect! (299.000000 != 598.000000) 301th item is incorrect! (300.000000 != 600.000000) 302th item is incorrect! (301.000000 != 602.000000) 303th item is incorrect! (302.000000 != 604.000000) 304th item is incorrect! (303.000000 != 606.000000) 305th item is incorrect! (304.000000 != 608.000000) 306th item is incorrect! (305.000000 != 610.000000) 307th item is incorrect! (306.000000 != 612.000000) 308th item is incorrect! (307.000000 != 614.000000) 309th item is incorrect! (308.000000 != 616.000000) 310th item is incorrect! (309.000000 != 618.000000) 311th item is incorrect! (310.000000 != 620.000000) 312th item is incorrect! (311.000000 != 622.000000) 313th item is incorrect! (312.000000 != 624.000000) 314th item is incorrect! (313.000000 != 626.000000) 315th item is incorrect! (314.000000 != 628.000000) 316th item is incorrect! (315.000000 != 630.000000) 317th item is incorrect! (316.000000 != 632.000000) 318th item is incorrect! (317.000000 != 634.000000) 319th item is incorrect! (318.000000 != 636.000000) 320th item is incorrect! (319.000000 != 638.000000) 321th item is incorrect! (320.000000 != 640.000000) 322th item is incorrect! (321.000000 != 642.000000) 323th item is incorrect! (322.000000 != 644.000000) 324th item is incorrect! (323.000000 != 646.000000) 325th item is incorrect! (324.000000 != 648.000000) 326th item is incorrect! (325.000000 != 650.000000) 327th item is incorrect! (326.000000 != 652.000000) 328th item is incorrect! (327.000000 != 654.000000) 329th item is incorrect! (328.000000 != 656.000000) 330th item is incorrect! (329.000000 != 658.000000) 331th item is incorrect! (330.000000 != 660.000000) 332th item is incorrect! (331.000000 != 662.000000) 333th item is incorrect! (332.000000 != 664.000000) 334th item is incorrect! (333.000000 != 666.000000) 335th item is incorrect! (334.000000 != 668.000000) 336th item is incorrect! (335.000000 != 670.000000) 337th item is incorrect! (336.000000 != 672.000000) 338th item is incorrect! (337.000000 != 674.000000) 339th item is incorrect! (338.000000 != 676.000000) 340th item is incorrect! (339.000000 != 678.000000) 341th item is incorrect! (340.000000 != 680.000000) 342th item is incorrect! (341.000000 != 682.000000) 343th item is incorrect! (342.000000 != 684.000000) 344th item is incorrect! (343.000000 != 686.000000) 345th item is incorrect! (344.000000 != 688.000000) 346th item is incorrect! (345.000000 != 690.000000) 347th item is incorrect! (346.000000 != 692.000000) 348th item is incorrect! (347.000000 != 694.000000) 349th item is incorrect! (348.000000 != 696.000000) 350th item is incorrect! (349.000000 != 698.000000) 351th item is incorrect! (350.000000 != 700.000000) 352th item is incorrect! (351.000000 != 702.000000) 353th item is incorrect! (352.000000 != 704.000000) 354th item is incorrect! (353.000000 != 706.000000) 355th item is incorrect! (354.000000 != 708.000000) 356th item is incorrect! (355.000000 != 710.000000) 357th item is incorrect! (356.000000 != 712.000000) 358th item is incorrect! (357.000000 != 714.000000) 359th item is incorrect! (358.000000 != 716.000000) 360th item is incorrect! (359.000000 != 718.000000) 361th item is incorrect! (360.000000 != 720.000000) 362th item is incorrect! (361.000000 != 722.000000) 363th item is incorrect! (362.000000 != 724.000000) 364th item is incorrect! (363.000000 != 726.000000) 365th item is incorrect! (364.000000 != 728.000000) 366th item is incorrect! (365.000000 != 730.000000) 367th item is incorrect! (366.000000 != 732.000000) 368th item is incorrect! (367.000000 != 734.000000) 369th item is incorrect! (368.000000 != 736.000000) 370th item is incorrect! (369.000000 != 738.000000) 371th item is incorrect! (370.000000 != 740.000000) 372th item is incorrect! (371.000000 != 742.000000) 373th item is incorrect! (372.000000 != 744.000000) 374th item is incorrect! (373.000000 != 746.000000) 375th item is incorrect! (374.000000 != 748.000000) 376th item is incorrect! (375.000000 != 750.000000) 377th item is incorrect! (376.000000 != 752.000000) 378th item is incorrect! (377.000000 != 754.000000) 379th item is incorrect! (378.000000 != 756.000000) 380th item is incorrect! (379.000000 != 758.000000) 381th item is incorrect! (380.000000 != 760.000000) 382th item is incorrect! (381.000000 != 762.000000) 383th item is incorrect! (382.000000 != 764.000000) 384th item is incorrect! (383.000000 != 766.000000) 385th item is incorrect! (384.000000 != 768.000000) 386th item is incorrect! (385.000000 != 770.000000) 387th item is incorrect! (386.000000 != 772.000000) 388th item is incorrect! (387.000000 != 774.000000) 389th item is incorrect! (388.000000 != 776.000000) 390th item is incorrect! (389.000000 != 778.000000) 391th item is incorrect! (390.000000 != 780.000000) 392th item is incorrect! (391.000000 != 782.000000) 393th item is incorrect! (392.000000 != 784.000000) 394th item is incorrect! (393.000000 != 786.000000) 395th item is incorrect! (394.000000 != 788.000000) 396th item is incorrect! (395.000000 != 790.000000) 397th item is incorrect! (396.000000 != 792.000000) 398th item is incorrect! (397.000000 != 794.000000) 399th item is incorrect! (398.000000 != 796.000000) 400th item is incorrect! (399.000000 != 798.000000) 401th item is incorrect! (400.000000 != 800.000000) 402th item is incorrect! (401.000000 != 802.000000) 403th item is incorrect! (402.000000 != 804.000000) 404th item is incorrect! (403.000000 != 806.000000) 405th item is incorrect! (404.000000 != 808.000000) 406th item is incorrect! (405.000000 != 810.000000) 407th item is incorrect! (406.000000 != 812.000000) 408th item is incorrect! (407.000000 != 814.000000) 409th item is incorrect! (408.000000 != 816.000000) 410th item is incorrect! (409.000000 != 818.000000) 411th item is incorrect! (410.000000 != 820.000000) 412th item is incorrect! (411.000000 != 822.000000) 413th item is incorrect! (412.000000 != 824.000000) 414th item is incorrect! (413.000000 != 826.000000) 415th item is incorrect! (414.000000 != 828.000000) 416th item is incorrect! (415.000000 != 830.000000) 417th item is incorrect! (416.000000 != 832.000000) 418th item is incorrect! (417.000000 != 834.000000) 419th item is incorrect! (418.000000 != 836.000000) 420th item is incorrect! (419.000000 != 838.000000) 421th item is incorrect! (420.000000 != 840.000000) 422th item is incorrect! (421.000000 != 842.000000) 423th item is incorrect! (422.000000 != 844.000000) 424th item is incorrect! (423.000000 != 846.000000) 425th item is incorrect! (424.000000 != 848.000000) 426th item is incorrect! (425.000000 != 850.000000) 427th item is incorrect! (426.000000 != 852.000000) 428th item is incorrect! (427.000000 != 854.000000) 429th item is incorrect! (428.000000 != 856.000000) 430th item is incorrect! (429.000000 != 858.000000) 431th item is incorrect! (430.000000 != 860.000000) 432th item is incorrect! (431.000000 != 862.000000) 433th item is incorrect! (432.000000 != 864.000000) 434th item is incorrect! (433.000000 != 866.000000) 435th item is incorrect! (434.000000 != 868.000000) 436th item is incorrect! (435.000000 != 870.000000) 437th item is incorrect! (436.000000 != 872.000000) 438th item is incorrect! (437.000000 != 874.000000) 439th item is incorrect! (438.000000 != 876.000000) 440th item is incorrect! (439.000000 != 878.000000) 441th item is incorrect! (440.000000 != 880.000000) 442th item is incorrect! (441.000000 != 882.000000) 443th item is incorrect! (442.000000 != 884.000000) 444th item is incorrect! (443.000000 != 886.000000) 445th item is incorrect! (444.000000 != 888.000000) 446th item is incorrect! (445.000000 != 890.000000) 447th item is incorrect! (446.000000 != 892.000000) 448th item is incorrect! (447.000000 != 894.000000) 449th item is incorrect! (448.000000 != 896.000000) 450th item is incorrect! (449.000000 != 898.000000) 451th item is incorrect! (450.000000 != 900.000000) 452th item is incorrect! (451.000000 != 902.000000) 453th item is incorrect! (452.000000 != 904.000000) 454th item is incorrect! (453.000000 != 906.000000) 455th item is incorrect! (454.000000 != 908.000000) 456th item is incorrect! (455.000000 != 910.000000) 457th item is incorrect! (456.000000 != 912.000000) 458th item is incorrect! (457.000000 != 914.000000) 459th item is incorrect! (458.000000 != 916.000000) 460th item is incorrect! (459.000000 != 918.000000) 461th item is incorrect! (460.000000 != 920.000000) 462th item is incorrect! (461.000000 != 922.000000) 463th item is incorrect! (462.000000 != 924.000000) 464th item is incorrect! (463.000000 != 926.000000) 465th item is incorrect! (464.000000 != 928.000000) 466th item is incorrect! (465.000000 != 930.000000) 467th item is incorrect! (466.000000 != 932.000000) 468th item is incorrect! (467.000000 != 934.000000) 469th item is incorrect! (468.000000 != 936.000000) 470th item is incorrect! (469.000000 != 938.000000) 471th item is incorrect! (470.000000 != 940.000000) 472th item is incorrect! (471.000000 != 942.000000) 473th item is incorrect! (472.000000 != 944.000000) 474th item is incorrect! (473.000000 != 946.000000) 475th item is incorrect! (474.000000 != 948.000000) 476th item is incorrect! (475.000000 != 950.000000) 477th item is incorrect! (476.000000 != 952.000000) 478th item is incorrect! (477.000000 != 954.000000) 479th item is incorrect! (478.000000 != 956.000000) 480th item is incorrect! (479.000000 != 958.000000) 481th item is incorrect! (480.000000 != 960.000000) 482th item is incorrect! (481.000000 != 962.000000) 483th item is incorrect! (482.000000 != 964.000000) 484th item is incorrect! (483.000000 != 966.000000) 485th item is incorrect! (484.000000 != 968.000000) 486th item is incorrect! (485.000000 != 970.000000) 487th item is incorrect! (486.000000 != 972.000000) 488th item is incorrect! (487.000000 != 974.000000) 489th item is incorrect! (488.000000 != 976.000000) 490th item is incorrect! (489.000000 != 978.000000) 491th item is incorrect! (490.000000 != 980.000000) 492th item is incorrect! (491.000000 != 982.000000) 493th item is incorrect! (492.000000 != 984.000000) 494th item is incorrect! (493.000000 != 986.000000) 495th item is incorrect! (494.000000 != 988.000000) 496th item is incorrect! (495.000000 != 990.000000) 497th item is incorrect! (496.000000 != 992.000000) 498th item is incorrect! (497.000000 != 994.000000) 499th item is incorrect! (498.000000 != 996.000000) 500th item is incorrect! (499.000000 != 998.000000) 501th item is incorrect! (500.000000 != 1000.000000) 502th item is incorrect! (501.000000 != 1002.000000) 503th item is incorrect! (502.000000 != 1004.000000) 504th item is incorrect! (503.000000 != 1006.000000) 505th item is incorrect! (504.000000 != 1008.000000) 506th item is incorrect! (505.000000 != 1010.000000) 507th item is incorrect! (506.000000 != 1012.000000) 508th item is incorrect! (507.000000 != 1014.000000) 509th item is incorrect! (508.000000 != 1016.000000) 510th item is incorrect! (509.000000 != 1018.000000) 511th item is incorrect! (510.000000 != 1020.000000) 512th item is incorrect! (511.000000 != 1022.000000) 511 errors! ERROR: ACCL base functionality test failed! -- STATISTICS - ID: 0 ----------------------------------------------- Read command FIFO used: 0 Write command FIFO used: 0 Host reads sent: 98 Host writes sent: 66 Card reads sent: 65 Card writes sent: 64 Sync reads sent: 10 Sync writes sent: 0 Page faults: 0 -- [31m[1mNET STATS[0m[0m QSFP0 RX pkgs: 510 TX pkgs: 138 ARP RX pkgs: 4 ARP TX pkgs: 2 ICMP RX pkgs: 0 ICMP TX pkgs: 0 TCP RX pkgs: 0 TCP TX pkgs: 0 ROCE RX pkgs: 180 ROCE TX pkgs: 136 IBV RX pkgs: 235 IBV TX pkgs: 236 PSN drop cnt: 0 Retrans cnt: 0 TCP session cnt: 0 STRM down: 0 Finalizing MPI... Done. Terminating... ```stderr
``` XRT build version: 2.13.466 Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776 Build date: 2022-04-14 17:43:11 Git branch: 2022.1 PID: 21422 UID: 500207 [Tue May 14 12:11:34 2024 GMT] HOST: alveo-u55c-07.inf.ethz.ch EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote [XRT] ERROR: No devices found [XRT] ERROR: No devices found [XRT] ERROR: No devices found ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2 ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1 CCLO HWID: 2696576574 at 0x0 CCLO source commit (first 24b): a0ba7e CCLO Capabilities: Stack type: RDMA Internal DMA:True External DMA:False Reduction:True Compression:True Kernel Streams:True Debug:False Doing a soft reset Configuring Eager RX Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fe2c6000000, Size: 64 calling offload: 7fe2c6000000, size: 64 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1 Allocation successful! Allocated buffer: 7fe2c5e00000, Size: 64 calling offload: 7fe2c5e00000, size: 64 Configuring Rendezvous Spare Buffers get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe2c5a00000, Size: 4194304 calling offload: 7fe2c5a00000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe2c5600000, Size: 4194304 calling offload: 7fe2c5600000, size: 4194304 get_device_type: coyote_device get_device_type: coyote_device CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2 Allocation successful! Allocated buffer: 7fe2c5200000, Size: 4194304 calling offload: 7fe2c5200000, size: 4194304 Configuring a communicator Configuring arithmetic Configuring collective tuning parameters CCLO configured Set timeout Set max eager size: 64 Set max rendezvous reduce size: 4194304 Accelerator ready! Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 Rank 0 passed last barrier before test! CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fe2c6000000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fe2c5e00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1 Allocation successful! Allocated buffer: 7fe2c5000000, Size: 2048 CoyoteBuffer contructor called! page_size:2097152, buffer_size:2048,n_pages:1 Allocation successful! Allocated buffer: 7fe2c4e00000, Size: 2048 Reducing data... Free user buffer from cProc cPid:0, buffer_size:2048,7fe2c5000000 Free user buffer from cProc cPid:0, buffer_size:2048,7fe2c4e00000 Communicator 0 (0x40): local rank: 0 number of ranks: 2 > rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0 > rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 1 CCLO address: 0 rx address: 4 Spare RX Buffer 0: address: 0x7fe2c6000000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Spare RX Buffer 1: address: 0x7fe2c5e00000 status: ENQUEUED occupancy: 0/64 MPI tag: 0 seq: 0 src: 0 Removing CCLO object at 0 Doing a soft reset Free user buffer from cProc cPid:0, buffer_size:64,7fe2c6000000 Free user buffer from cProc cPid:0, buffer_size:64,7fe2c5e00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe2c5a00000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe2c5600000 Free user buffer from cProc cPid:0, buffer_size:4194304,7fe2c5200000 ```