aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
147 stars 56 forks source link

fix: ep release in endpoint per comm #706

Closed AmedeoSapio closed 6 days ago

AmedeoSapio commented 6 days ago

In the endpoint per communicator path, we keep shared CQs in the device to be used by all endpoints, so we don't want to release the CQ when we release an endpoint. This is fixing a bug that was causing the CQ to be released in ep_rail_release by mistake, that resulted in a segfault when the CQ was used again.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.