Switch printing information about global registration scope, DMA-BUF support, and ordering from INFO to TRACE, as they happen for every endpoint created, and with the endpoint per communicator code, this results in hundreds of prints, even for a 32 GPU job.
Add a startup time print for the global registration and dma-buf state, since those may prove useful.
For a 4 node allreduce test, this reduces the output from 4062 lines to 2974 lines, a reduction of 26%, and was inspired by having to parse through files from our 16 node runs, where it's basically all endpoint properties prints.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Switch printing information about global registration scope, DMA-BUF support, and ordering from INFO to TRACE, as they happen for every endpoint created, and with the endpoint per communicator code, this results in hundreds of prints, even for a 32 GPU job.
Add a startup time print for the global registration and dma-buf state, since those may prove useful.
For a 4 node allreduce test, this reduces the output from 4062 lines to 2974 lines, a reduction of 26%, and was inspired by having to parse through files from our 16 node runs, where it's basically all endpoint properties prints.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.