aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
140 stars 54 forks source link

Log prov_errno number in case of completion errors #460

Closed AmedeoSapio closed 3 months ago

AmedeoSapio commented 3 months ago

When there is a completion error, the plugin is calling fi_cq_strerror() to get a string message corresponding to the specific error. However, there are errors that are not recognized by libfabric, and that function will just return "Unknown error". This PR is adding the prov_errno value, together with the string representation, to the error log statement, to make it easier to debug such cases.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

AmedeoSapio commented 3 months ago

bot:aws:retest