Closed ukemamaster closed 3 years ago
It seems like the data pipeline DatasetLoader.py
is the bottleneck but i have no idea how to find the exact location?
Well, i moved the data from HDD to SSD and it solved my problem. I didn’t know HDD can be such a bottleneck. Anyway, now on SSD the training is very smooth as expected, both on single- and multi-GPU devices.
@joonson I saw a very strange behavior during training. The Training slows down after a few steps (generally after 50% steps in the first epoch). The Hz drops from 300+ to less than 10. At this point the GPU utilization (of 6 or 7 out of 8 GPUs) goes to 100% from 90%. While the remaining 1 or 2 of the GPUs go down to 0% utilization. No matter what training configuration i set, i see the same behavior. In this way i have one epoch (on Voxceleb-2 dev set) in approximately 4 hours.
Environment:
Python Version: 3.6.9
[GCC 8.4.0]
PyTorch Version: 1.8.1+cu102
Number of GPUs: 8
I tried to use the
torch.utils.bottleneck
which displays:It seems like
{method 'cuda' of 'torch._C._TensorBase' objects}
is the slowest part.Have you seen such behavior in your training?
Where do you think could be the bottleneck? Any tips or suggestions on this?