Excessive memory use due to train.num_workers > 0

mkunes commented 5 years ago

Hello, thanks for sharing your code.

I've succesfully tried it with TIMIT data. However, I have come across a memory issue when training on a much larger dataset (over 7000 speakers) - after the first epoch, used memory starts rapidly increasing and eventually surpasses available RAM.

From what I've been able to find, this is actually caused by a known problem with PyTorch DataLoader:

1: CPU memory gradually leaks when num_workers > 0 in the DataLoader 2: https://forums.fast.ai/t/runtimeerror-dataloader-worker-is-killed-by-signal/31277

The above links mention some possible workarounds you might be able to use in your code. Barring that, setting train.num_workers to zero (default is 8) solves this problem - possibly at the cost of speed, although in my case the training speed actually improved.

Even if you're not able to fix this completely, I would suggest at least changing the default train.num_workers setting from 8 to 0.

(Note: I suspect this may also be the cause of issue https://github.com/HarryVolek/PyTorch_Speaker_Verification/issues/20)

HarryVolek commented 5 years ago

Hi, thanks for bringing this to my attention.

HarryVolek commented 5 years ago

07f996a

HarryVolek / PyTorch_Speaker_Verification

Excessive memory use due to train.num_workers > 0 #26