Closed yuri-qq closed 3 years ago
I have never been used distributed training, but here are something differences with TensorFlow's sample:
I see. I did some more research on ResNets and using TensorFlow and I think I'll try to come up with something on my own. Thanks!
Hello,
first of all, thank you for this awesome project. I'm interested in training my own model. Unfortunately with the current GPU shortage I can't buy an appropriate GPU for a reasonable price. However, I have 2 GTX 1060 6GB and could possibly get hold of more. That got me wondering if it would be possible to train on mutliple GPUs. I read the documentation of tf.distribute.MirroredStrategy and tried to modify the code of the train_project function similar to how it was done in the example of the TensorFlow docs, but that resulted in 2/3 of the performance of just 1 GPU. I have to say I'm not at all familiar with neural networks. I get the basic concept, but I never worked with a library like TensorFlow. Researching the problem I found that for very densely connected neural networks a mirrored strategy might actually hurt performance, because of bandwidth limitations. The NVIDIA X Server Settings show me a peak PCIe bandwidth utilization of around 50%. The guide on the TensorFlow website claims the tf.distribute.Strategy API can be utilized "with minimal code changes", but I guess that depends and might not be that easy in some cases. When I start the training, the model is loaded into the memory of both GPUs and both GPUs are utilized, but I get a lot of these warnings:
So my question is, do you have an idea why the training on multiple GPUs is so slow? Could it be a problem with how the neural network is architectured or did I do something wrong?
Thanks for your time and effort!