NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 831 forks source link

Is there someway to measure gpu i/o usage or allreduce waiting time? #1355

Open MaCasK9 opened 4 months ago

MaCasK9 commented 4 months ago

I am trying to optimize a deep learning training task, and there is a need to measure gpu i/o usage (for example how much time gpu i/o is working/idle during an iteration or an allreduce operation). I've made some search and found tools like nvidia-smi, nvprof, nsight and nccl-test. However, to my understanding, nvidia-smi don't provide i/o usage info, nccl-test only provide system bandwidth, nvprof/nsight provides result after the whole task ends. I havent actually used these tools much so please tell me if I am wrong. So I am wondering if there is any method to monitor gpu i/o usage during a particular time in python, which can be called each iter (or at least every n iters). A method that measures current gpu-gpu bandwidth may also help if it is not time-consuming. I think nccl do make such measurement when initalizing graphs? so is it ok to call that function during training? As my final purpose is to optimize allreduce, if there is some method to measure how much time a gpu waited during an allreduce operation is also ok.

MaCasK9 commented 4 months ago

It seems that #349 has something similar to my question, but it isnt in-training measure.