Closed robd003 closed 5 years ago
@carlfm01 I'm using your fork at the moment, but I changed the SimpleTrainer in train2.py back to the SyncMultiGPUTrainerReplicated so that it'll run on more than 1 GPU
Hi, I used a single K80 and for the net 1 was like 0.6it/s on Windows and for the net 2 can't remember, I think I post pictures in few threads.
Here's what I'm seeing at the moment with SyncMultiGPUTrainerReplicated per epoch:
1x GPU: 9 minutes (16 cores and 70GB RAM, 4 prefetch processes) 2x GPU: 14 minutes (16 cores and 70GB RAM, 4 prefetch processes) 4x GPU: 36 minutes (32 cores and 120GB RAM, 8 prefetch processes) 8x GPU 1 hour and a half (using 32 cores and 120GB RAM, 8 prefetch processes)
I've tried using TCMalloc from Google's Perftools and it seems like that has sped things up slightly.
Does anyone else have tips on getting Net2 to be more performant? I've got very little disk I/O, but almost all the CPU usage is in the multiprocessing prefetch processes.
Yes, facing the same issue. Any clues on optimizing space usage while running train2.py ? Thanks.
I am also facing the same issue. But I can't figure out which parts in train2.py lead to this issue. Any one can point it out for me? Many thx
I've been trying to train Net2 with multiple GPUs but the it/s drops to 0.02 - 0.05 whereas if I'm using a single K80 it's running at 0.38it/s
It seems like most of the system is working on the multiprocessing QueueInput Net2DataFlow with a very high system loadavg (117)
The QueueInput/queue_size looks fine at: 48.129
I tried increasing the batch_size to 64 to see if that would help.
Does anyone have advice for getting performance up during training of Net2?