aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
147 stars 56 forks source link

Add option to abort() on error #683

Closed bwbarrett closed 3 weeks ago

bwbarrett commented 3 weeks ago

Some versions of NCCL (notably 2.21 -- 2.23) do not properly handle the case where an asynchronous request returns an error, leading to hangs for the customer. This is not ideal, especially with frameworks like JAX that do not have collective timeout monitorrs.

To help customers work around this issue, this patch adds an environment variable OFI_NCCL_ABORT_ON_ERROR that will cause the plugin to call abort() if the env var is set to anything other than 0.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.