Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

Address not mapped: 0x78 #178

Closed mar-ven closed 5 months ago

mar-ven commented 8 months ago

Regarding the Coyote host, assume that it crashes / hangs due to some bad configuration of the TLBs.

If the CoyoteDevice obj is created with mpi_size and then configure_cyt_rdma is invoked (i.e., you're here: https://github.com/Xilinx/ACCL/blob/dev/test/host/Coyote/test.cpp#L1085), then in initialize_accl, this function: https://github.com/Xilinx/ACCL/blob/dev/driver/xrt/include/accl.hpp#L140 crashes, with "Address not mapped: 0x78". The cclo->read fails, and the only reason I find is that the this->cclo pointer is misconfigured somehow. So calling the function read() is not possible.

I found out that a possible "fix" is to comment this line: https://github.com/Xilinx/ACCL/blob/dev/test/host/Coyote/test.cpp#L315. as well as adding an else branch to this if: https://github.com/Xilinx/ACCL/blob/dev/test/host/Coyote/test.cpp#L1080, that creates a CoyoteDevice with this constructor: https://github.com/Xilinx/ACCL/blob/dev/driver/xrt/src/coyotedevice.cpp#L278.

This way I can invoke the accl_on_coyote process without the -r flag, and the cclo object is constructed properly. Doing so, the cclo->read() is invoked correctly, without the "Address not mapped" issue.

mar-ven commented 8 months ago

As a further info, this print appears: https://github.com/Xilinx/ACCL/blob/dev/driver/xrt/src/coyotedevice.cpp#L301 It looks like there is a cProc initialization error