Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

Repeated send/recv gets stuck #139

Open Mellich opened 1 year ago

Mellich commented 1 year ago

Repeated calls of send/recv of the following form get stuck after several iterations on two ranks:

  for (int i=0;i<2000;i++) {
    accl.send(*op_buf, count, next_rank, 0, GLOBAL_COMM, true);
    accl.recv(*res_buf, count, prev_rank, 0, GLOBAL_COMM, true);
  }

The behavior seems to be non-deterministic and may only appear with large message sizes and high numbers of repetitions. It happens more frequently for FPGA-to-FPGA communication as shown in the example but I also observed it for CPU-to-CPU via ACCL. It occurs using the UDP and TCP stack.

Setting a sufficiently long sleep between the iterations seems to increase stability.

I modified the XRT tests to show the described behavior here: https://github.com/Mellich/ACCL/blob/f805e8f87a91878228173668553ce25f9b9eaa31/test/host/xrt/test.cpp#L347

Using the branch above, the test gets stuck reliably for me qwhen executed with the following command:

mpirun -n 2 --tag-output  test -f -x bitstream.xclbin -s 262144 -u -b 1024 -d -n 2000

Example dump of CMAC and network layer status of the UDP version after the execution got stuck:

Message size: 1MB

Output of test:

Start send recv test...
Start send recv test...
Sending data on 0 to 1...
Receiving data on 0 from 1...
Sending data on 1 to 0...
Sending data on 0 to 1...
Receiving data on 1 from 0...
Receiving data on 0 from 1...
Sending data on 1 to 0...
Receiving data on 1 from 0...
Sending data on 0 to 1...
Receiving data on 0 from 1...

Rank 0:

{'tx': {'packets': 4610,
  'good_packets': 4610,
  'bytes': 6512924,
  'good_bytes': 6512924,
  'packets_64B': 512,
  'packets_65_127B': 0,
  'packets_128_255B': 0,
  'packets_256_511B': 0,
  'packets_512_1023B': 3,
  'packets_1024_1518B': 0,
  'packets_1519_1522B': 0,
  'packets_1523_1548B': 0,
  'packets_1549_2047B': 4095,
  'packets_2048_4095B': 0,
  'packets_4096_8191B': 0,
  'packets_8192_9215B': 0,
  'packets_large': 0,
  'packets_small': 0,
  'bad_fcs': 0,
  'pause': 0,
  'user_pause': 0},
 'rx': {'packets': 3244,
  'good_packets': 3244,
  'bytes': 4352872,
  'good_bytes': 4352872,
  'packets_64B': 512,
  'packets_65_127B': 0,
  'packets_128_255B': 0,
  'packets_256_511B': 0,
  'packets_512_1023B': 2,
  'packets_1024_1518B': 0,
  'packets_1519_1522B': 0,
  'packets_1523_1548B': 0,
  'packets_1549_2047B': 2730,
  'packets_2048_4095B': 0,
  'packets_4096_8191B': 0,
  'packets_8192_9215B': 0,
  'packets_large': 0,
  'packets_small': 0,
  'packets_undersize': 0,
  'packets_fragmented': 0,
  'packets_oversize': 0,
  'packets_toolong': 0,
  'packets_jabber': 0,
  'bad_fcs': 0,
  'packets_bad_fcs': 0,
  'stomped_fcs': 0,
  'pause': 0,
  'user_pause': 0},
 'cycle_count': 21535156494}

{'tx_path': {'arp': {'packets': 512, 'bytes': 30720, 'cycles': 240149},
  'icmp': {'packets': 0, 'bytes': 0, 'cycles': 0},
  'ethernet_header_inserter': {'packets': 4098,
   'bytes': 6463764,
   'cycles': 520519},
  'ethernet': {'packets': 4610, 'bytes': 6494484, 'cycles': 57800444},
  'udp': {'packets': 4098, 'bytes': 6406392, 'cycles': 520513},
  'app': {'packets': 4098, 'bytes': 6291648, 'cycles': 520224}},
 'rx_path': {'ethernet': {'packets': 3244,
   'bytes': 4339896,
   'cycles': 57683890},
  'packet_handler': {'packets': 3244, 'bytes': 4301648, 'cycles': 57683891},
  'arp': {'packets': 512, 'bytes': 30720, 'cycles': 147925},
  'icmp': {'packets': 0, 'bytes': 0, 'cycles': 0},
  'udp': {'packets': 2732, 'bytes': 4270928, 'cycles': 280258},
  'app': {'packets': 2732, 'bytes': 4194432, 'cycles': 280255}}}

Rank 1:

{'tx': {'packets': 3244,
  'good_packets': 3244,
  'bytes': 4352872,
  'good_bytes': 4352872,
  'packets_64B': 512,
  'packets_65_127B': 0,
  'packets_128_255B': 0,
  'packets_256_511B': 0,
  'packets_512_1023B': 2,
  'packets_1024_1518B': 0,
  'packets_1519_1522B': 0,
  'packets_1523_1548B': 0,
  'packets_1549_2047B': 2730,
  'packets_2048_4095B': 0,
  'packets_4096_8191B': 0,
  'packets_8192_9215B': 0,
  'packets_large': 0,
  'packets_small': 0,
  'bad_fcs': 0,
  'pause': 0,
  'user_pause': 0},
 'rx': {'packets': 4610,
  'good_packets': 4610,
  'bytes': 6512924,
  'good_bytes': 6512924,
  'packets_64B': 512,
  'packets_65_127B': 0,
  'packets_128_255B': 0,
  'packets_256_511B': 0,
  'packets_512_1023B': 3,
  'packets_1024_1518B': 0,
  'packets_1519_1522B': 0,
  'packets_1523_1548B': 0,
  'packets_1549_2047B': 4095,
  'packets_2048_4095B': 0,
  'packets_4096_8191B': 0,
  'packets_8192_9215B': 0,
  'packets_large': 0,
  'packets_small': 0,
  'packets_undersize': 0,
  'packets_fragmented': 0,
  'packets_oversize': 0,
  'packets_toolong': 0,
  'packets_jabber': 0,
  'bad_fcs': 0,
  'packets_bad_fcs': 0,
  'stomped_fcs': 0,
  'pause': 0,
  'user_pause': 0},
 'cycle_count': 48459584151}

{'tx_path': {'arp': {'packets': 512, 'bytes': 30720, 'cycles': 147923},
  'icmp': {'packets': 0, 'bytes': 0, 'cycles': 0},
  'ethernet_header_inserter': {'packets': 2732,
   'bytes': 4309176,
   'cycles': 279613},
  'ethernet': {'packets': 3244, 'bytes': 4339896, 'cycles': 57683218},
  'udp': {'packets': 2732, 'bytes': 4270928, 'cycles': 279606},
  'app': {'packets': 2732, 'bytes': 4194432, 'cycles': 279407}},
 'rx_path': {'ethernet': {'packets': 2138,
   'bytes': 2596932,
   'cycles': 57538733},
  'packet_handler': {'packets': 2138, 'bytes': 2572824, 'cycles': 57538728},
  'arp': {'packets': 512, 'bytes': 30720, 'cycles': 240150},
  'icmp': {'packets': 0, 'bytes': 0, 'cycles': 0},
  'udp': {'packets': 1626, 'bytes': 2542104, 'cycles': 258776},
  'app': {'packets': 1626, 'bytes': 2496576, 'cycles': 258771}}}

All send packets are also received by the network layer of the other rank, so no data seems to get lost over the link. However, there is a discrepancy between sent and received packets on both ranks. Shouldn't the count of packets be the same in this scenario?

The recv should block the subsequent send, so rx and tx should stay in balance.

quetric commented 1 year ago

Hi @Mellich

Could you clarify this: "It occurs using the UDP and TCP stack."

I can see a mechanism whereby backpressure from the RX pipeline causes the UDP stack to drop packets. This shouldn't happen for the TCP stack though.

Please add ILAs to the input and output of the UDP stack to give us visibility into what's going on, and share the waveforms.

Mellich commented 1 year ago

Could you clarify this: "It occurs using the UDP and TCP stack."

I can see a mechanism whereby backpressure from the RX pipeline causes the UDP stack to drop packets. This shouldn't happen for the TCP stack though.

I executed the same experiment with another bitstream containing the TCP stack. Also there, the execution gets stuck after several iterations when large message sizes are used. The issue occurred more frequently for message sizes of 512KB and above (-s 131072). I mainly tested powers of two so far, but it seems there is no strict line where this issue occurs. It rather seems to get more likely the larger the message. But I did not observe many differences between the UDP and TCP bitstreams, for now, so I would assume this is an issue independent of the network stack... or two different issues leading to similar behavior.

TristanLaan commented 1 year ago

I have been able to replicate this bug on our infrastructure with the following settings: bin/test -d -u -f -x ${XCLBIN_UDP} -b 1024 -s 131072 --rsfec -n 2000

I have changed @Mellich his code a little bit in my branch so that it automatically prints some statistics if the send/recv gets stuck. I've attached the statistics that I got from our infrastructure. Most notably, the CMAC numbers seem to match, but the Network Layer numbers don't. Is this maybe a problem with VNx instead of ACCL? (This of course doesn't explain the TCP problems)

Rank 0:

CMAC stats:
Cycle count: 831533921
Tx:
bad_fcs: 0
bytes: 305785848
good_bytes: 305785848
good_packets: 195108
packets: 195108
packets_1024_1518B: 0
packets_128_255B: 0
packets_1519_1522B: 0
packets_1523_1548B: 0
packets_1549_2047B: 193006
packets_2048_4095B: 0
packets_256_511B: 0
packets_4096_8191B: 0
packets_512_1023B: 566
packets_64B: 1536
packets_65_127B: 0
packets_8192_9215B: 0
packets_large: 0
packets_small: 0
pause: 0
user_pause: 0
Rx:
bad_fcs: 0
bytes: 306265447
good_bytes: 306265447
good_packets: 194461
packets: 194461
packets_1024_1518B: 0
packets_128_255B: 0
packets_1519_1522B: 0
packets_1523_1548B: 0
packets_1549_2047B: 193347
packets_2048_4095B: 0
packets_256_511B: 13
packets_4096_8191B: 0
packets_512_1023B: 567
packets_64B: 534
packets_65_127B: 0
packets_8192_9215B: 0
packets_bad_fcs: 0
packets_fragmented: 0
packets_jabber: 0
packets_large: 0
packets_oversize: 0
packets_small: 0
packets_toolong: 0
packets_undersize: 0
pause: 0
stomped_fcs: 0
user_pause: 0

Network Layer stats:
udp out bytes = 302203248
ethhi_out_bytes = 304913256
eth_out_bytes = 305005416
udp in bytes = 301939892
udp app out bytes = 296783232
udp app in bytes = 296524608

Rank 1:

CMAC stats:
Cycle count: 3733636461
Tx:
bad_fcs: 0
bytes: 306325932
good_bytes: 306325932
good_packets: 195450
packets: 195450
packets_1024_1518B: 0
packets_128_255B: 0
packets_1519_1522B: 0
packets_1523_1548B: 0
packets_1549_2047B: 193347
packets_2048_4095B: 0
packets_256_511B: 0
packets_4096_8191B: 0
packets_512_1023B: 567
packets_64B: 1536
packets_65_127B: 0
packets_8192_9215B: 0
packets_large: 0
packets_small: 0
pause: 0
user_pause: 0
Rx:
bad_fcs: 0
bytes: 305724888
good_bytes: 305724888
good_packets: 194115
packets: 194115
packets_1024_1518B: 0
packets_128_255B: 0
packets_1519_1522B: 0
packets_1523_1548B: 0
packets_1549_2047B: 193006
packets_2048_4095B: 0
packets_256_511B: 12
packets_4096_8191B: 0
packets_512_1023B: 566
packets_64B: 531
packets_65_127B: 0
packets_8192_9215B: 0
packets_bad_fcs: 0
packets_fragmented: 0
packets_jabber: 0
packets_large: 0
packets_oversize: 0
packets_small: 0
packets_toolong: 0
packets_undersize: 0
pause: 0
stomped_fcs: 0
user_pause: 0

Network Layer stats:
udp out bytes = 302737176
ethhi_out_bytes = 305451972
eth_out_bytes = 305544132
udp in bytes = 302203248
udp app out bytes = 297307584
udp app in bytes = 296783232
quetric commented 1 year ago

Traced this problem back to ACCL applying backpressure into the POE which causes packet loss with UDP and with TCP if RX bypass is enabled.

The current work-around is to use TCP and disable RX bypass. I made this configuration default for now.

Keeping the issue open while I debug the cause for the backpressure