Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

Unit Tests hang and fail on dev branch #169

Closed Mellich closed 9 months ago

Mellich commented 10 months ago

Some of the unit tests hang in the dev branch for higher number of ranks (tested with 10):

Moreover, some other tests in ACCLFuncTest.* are failing.

Mellich commented 10 months ago

Fix failing tests in #170

quetric commented 10 months ago

Some tests hang for smaller number of ranks too, and different tests hang for different backends (UDP/TCP/RDMA). At least for alltoall and barrier, the cause is that the collective, as currently implemented in firmware, requires RDMA.