Some versions of NCCL (notably 2.21 -- 2.23) do not properly handle the case where an asynchronous request returns an error, leading to hangs for the customer. This is not ideal, especially with frameworks like JAX that do not have collective timeout monitorrs.
To help customers work around this issue, this patch adds an environment variable OFI_NCCL_ABORT_ON_ERROR that will cause the plugin to call abort() if the env var is set to anything other than 0.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Some versions of NCCL (notably 2.21 -- 2.23) do not properly handle the case where an asynchronous request returns an error, leading to hangs for the customer. This is not ideal, especially with frameworks like JAX that do not have collective timeout monitorrs.
To help customers work around this issue, this patch adds an environment variable OFI_NCCL_ABORT_ON_ERROR that will cause the plugin to call abort() if the env var is set to anything other than 0.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.