Is there someway to measure gpu i/o usage or allreduce waiting time?

I am trying to optimize a deep learning training task, and there is a need to measure gpu i/o usage (for example how much time gpu i/o is working/idle during an iteration or an allreduce operation). I've made some search and found tools like nvidia-smi, nvprof, nsight and nccl-test. However, to my understanding, nvidia-smi don't provide i/o usage info, nccl-test only provide system bandwidth, nvprof/nsight provides result after the whole task ends. I havent actually used these tools much so please tell me if I am wrong. So I am wondering if there is any method to monitor gpu i/o usage during a particular time in python, which can be called each iter (or at least every n iters). A method that measures current gpu-gpu bandwidth may also help if it is not time-consuming. I think nccl do make such measurement when initalizing graphs? so is it ok to call that function during training? As my final purpose is to optimize allreduce, if there is some method to measure how much time a gpu waited during an allreduce operation is also ok.

NVIDIA / nccl

Is there someway to measure gpu i/o usage or allreduce waiting time? #1355