aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
147 stars 56 forks source link

Reduce repetitive INFO printing #684

Closed bwbarrett closed 2 weeks ago

bwbarrett commented 3 weeks ago

Switch printing information about global registration scope, DMA-BUF support, and ordering from INFO to TRACE, as they happen for every endpoint created, and with the endpoint per communicator code, this results in hundreds of prints, even for a 32 GPU job.

Add a startup time print for the global registration and dma-buf state, since those may prove useful.

For a 4 node allreduce test, this reduces the output from 4062 lines to 2974 lines, a reduction of 26%, and was inspired by having to parse through files from our 16 node runs, where it's basically all endpoint properties prints.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.