Open laghai opened 7 years ago
Depending on which built of Caffe you're using, the TEST phase may happen entirely on one GPU (/cc @drnikolaev). You could try setting a small batch size for your TEST data loader, and a larger one for your TRAIN loader. It should be easy to find in the network description - in the first few layers.
I tried increasing the batch sizes in the network description for training and the model started correctly but quickly failed with out of memory errors until I came back all the way down to original batch size of 24. I also never see the other GPU memories filling up, so it doesn't look like training behaves differently from validation. I'm on caffe 0.5.14.
I'm using DIGITS 6.0.0 with caffe 0.5.14 on an 8 GPU EC2 instance (p2.8xlarge, 11.2GB memory) with 640x640 images. Previously I was able to train the same model on a 1 GPU instance (p2.xlarge, 11.2GB memory) with batch size of 10 (network default). Documentation tells me that on multi-GPU the batch size should be multiplied by the number of GPUs, but using batch size of 80 on the 8 GPU machine fails with out of memory error. The largest batch size that seems to work is 24. Moreover, it appears that the memory is not utilized evenly across the GPUs (~90% on the first, only ~20% on the other seven). This matches the lower batch size (110 + 710*0.2 = 24). The run time is also not 8x better (~5.5 hours on 1 GPU, ~1.5 hours on 8 GPU). What could be the problem?
Typical hardware status on the 8 GPU (GPUs 2-7 show the same as 1) Hardware Tesla K80 (#0) Memory 9.98 GB / 11.2 GB (89.4%) GPU Utilization 96% Temperature 73 °C
Tesla K80 (#1) Memory 2.44 GB / 11.2 GB (21.8%) GPU Utilization 54% Temperature 54 °C