Net2 Training is slower with more than 1 GPU

andabi / deep-voice-conversion

Deep neural networks for voice conversion (voice style transfer) in Tensorflow

MIT License

3.92k stars 845 forks source link

Net2 Training is slower with more than 1 GPU #76

Closed robd003 closed 5 years ago

robd003 commented 6 years ago

I've been trying to train Net2 with multiple GPUs but the it/s drops to 0.02 - 0.05 whereas if I'm using a single K80 it's running at 0.38it/s

It seems like most of the system is working on the multiprocessing QueueInput Net2DataFlow with a very high system loadavg (117)

The QueueInput/queue_size looks fine at: 48.129

I tried increasing the batch_size to 64 to see if that would help.

Does anyone have advice for getting performance up during training of Net2?

robd003 commented 6 years ago

@carlfm01 I'm using your fork at the moment, but I changed the SimpleTrainer in train2.py back to the SyncMultiGPUTrainerReplicated so that it'll run on more than 1 GPU

carlfm01 commented 5 years ago

Hi, I used a single K80 and for the net 1 was like 0.6it/s on Windows and for the net 2 can't remember, I think I post pictures in few threads.

robd003 commented 5 years ago

Here's what I'm seeing at the moment with SyncMultiGPUTrainerReplicated per epoch:

1x GPU: 9 minutes (16 cores and 70GB RAM, 4 prefetch processes) 2x GPU: 14 minutes (16 cores and 70GB RAM, 4 prefetch processes) 4x GPU: 36 minutes (32 cores and 120GB RAM, 8 prefetch processes) 8x GPU 1 hour and a half (using 32 cores and 120GB RAM, 8 prefetch processes)

I've tried using TCMalloc from Google's Perftools and it seems like that has sped things up slightly.

Does anyone else have tips on getting Net2 to be more performant? I've got very little disk I/O, but almost all the CPU usage is in the multiprocessing prefetch processes.

wishvivek commented 5 years ago

Yes, facing the same issue. Any clues on optimizing space usage while running train2.py ? Thanks.

MorganCZY commented 5 years ago

I am also facing the same issue. But I can't figure out which parts in train2.py lead to this issue. Any one can point it out for me? Many thx