Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

fixes #178 by avoiding memory leaks and implementing signal handlers #179

Closed mar-ven closed 7 months ago

mar-ven commented 7 months ago

I figured out that issue #178 ("Address not mapped") happens because, in case of crash of the ACCL constructor, or CoyoteDevice constructor, the destructors routines are not invoked, and the cProcess objects are not properly destroyed (with consequences at the Coyote driver level). I therefore introduced some smart pointers and some signal handlers to make sure that, in case of exception, cProcesses are correctly destroyed. I extended such modifications to xrt as well, to improve clean-up.

quetric commented 7 months ago

I think you forgot to add initialize() after construction in initialize_accl() of accl_network_utils.cpp which is used by the XRT tests. When I run the tests, i get an exception saying i need to call initialize()

Also i noticed that you did not prune the arguments of the ACCL constructor, so most of the arguments don't do anything now. If we're going down this route with constructor then initializer, then let's make the constructor do the bare minimum (creating a device of the appropriate type) and the initializer do all of the rest.