clovaai / voxceleb_trainer

In defence of metric learning for speaker recognition
MIT License
1.02k stars 272 forks source link

Training slows down after few steps #110

Closed ukemamaster closed 3 years ago

ukemamaster commented 3 years ago

@joonson I saw a very strange behavior during training. The Training slows down after a few steps (generally after 50% steps in the first epoch). The Hz drops from 300+ to less than 10. At this point the GPU utilization (of 6 or 7 out of 8 GPUs) goes to 100% from 90%. While the remaining 1 or 2 of the GPUs go down to 0% utilization. No matter what training configuration i set, i see the same behavior. In this way i have one epoch (on Voxceleb-2 dev set) in approximately 4 hours.

Environment: Python Version: 3.6.9 [GCC 8.4.0] PyTorch Version: 1.8.1+cu102 Number of GPUs: 8

I tried to use the torch.utils.bottleneck which displays:

--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         15180044 function calls (15164020 primitive calls) in 9.575 seconds

   Ordered by: internal time
   List reduced from 2828 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      289    4.347    0.015    4.347    0.015 {method 'cuda' of 'torch._C._TensorBase' objects}
  1265289    1.193    0.000    1.951    0.000 /usr/lib/python3.6/posixpath.py:75(join)
        1    1.125    1.125    4.361    4.361 DatasetLoader.py:109(__init__)
  2442931    0.391    0.000    0.391    0.000 {method 'split' of 'str' objects}
  1266257    0.228    0.000    0.368    0.000 /usr/lib/python3.6/posixpath.py:41(_get_sep)
  2566149    0.166    0.000    0.166    0.000 {method 'append' of 'list' objects}
1309307/1309020    0.146    0.000    0.150    0.000 {built-in method builtins.isinstance}
   173830    0.140    0.000    0.147    0.000 /usr/lib/python3.6/glob.py:114(_iterdir)
  1269132    0.136    0.000    0.136    0.000 {method 'endswith' of 'str' objects}
  1276670    0.136    0.000    0.136    0.000 {method 'startswith' of 'str' objects}
  1271399    0.122    0.000    0.122    0.000 {built-in method posix.fspath}
        1    0.112    0.112    0.266    0.266 DatasetLoader.py:124(<listcomp>)
  1092543    0.099    0.000    0.099    0.000 {method 'strip' of 'str' objects}
        2    0.092    0.046    0.097    0.048 {method 'readlines' of '_io._IOBase' objects}
        1    0.064    0.064    0.742    0.742 DatasetLoader.py:59(__init__)

It seems like {method 'cuda' of 'torch._C._TensorBase' objects} is the slowest part.

Have you seen such behavior in your training?
Where do you think could be the bottleneck? Any tips or suggestions on this?

ukemamaster commented 3 years ago

It seems like the data pipeline DatasetLoader.py is the bottleneck but i have no idea how to find the exact location?

ukemamaster commented 3 years ago

Well, i moved the data from HDD to SSD and it solved my problem. I didn’t know HDD can be such a bottleneck. Anyway, now on SSD the training is very smooth as expected, both on single- and multi-GPU devices.