aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
129 stars 51 forks source link

rdma: add dev_id to req completion LTTNG trace point #446

Closed taeilum00 closed 2 weeks ago

taeilum00 commented 2 weeks ago

This adds dev_id to NCCL_OFI_TRACE_COMPLETIONS trace point, which enables collecting SEND/RECV completions per dev_id. Adding dev_id is also aligned with existing NCCL_OFI_TRACE_SEND/NCCL_OFI_TRACE_RECV trace points that already have dev_id to the trace.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

a-szegel commented 2 weeks ago

bot:aws:retest... failed p5 al2 stage due to flakey SSH Issue

sunkuamzn commented 2 weeks ago

bot:aws:retest