[Discussion] How to train using more than 1 GPU?

deezer / spleeter

Deezer source separation library including pretrained models.

https://research.deezer.com/projects/spleeter.html

MIT License

25.62k stars 2.81k forks source link

[Discussion] How to train using more than 1 GPU? #275

Open aidv opened 4 years ago

aidv commented 4 years ago

Is this possible?

Can I train using multiple GPUs?

stickyninja3 commented 4 years ago

I don't think so. Tensorflows documentation states that it does not place operations into multiple GPUs automatically. Tensorflow does not easily share graphs or sessions among multiple processess. There are some blogs on this discussion on the towardsdatascience.com site

I assume you have been able to get the training working. What is your set-up?

aidv commented 4 years ago

@stickyninja3 There's something called Distributed training that implies that it is possible.

tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.

Also looking at the Spleeter source code, it implies that multiple machines can be used to train a model.

What I wonder now is why multiple machines before taking full advantage of multiple GPU's, unless multiple GPU's in multiple machines are present.

Either way I'd love if the Spleeter devs would address this as it would greatly benefit the community.

So what would be nice to address is:

How to train using multiple GPU's
How to train using multiple machines
Both of the above

mmoussallam commented 4 years ago

Hi @aidv

We have no plans to work on this feature for the moment. We don't have much experience with the Distributed training strategies and as @stickyninja3 said, it would probably require quite a lot of tuning to make it efficient.

If you feel that it can be achieved with minor changes, feel free to send us a gist of code and we'll look into it.

aidv commented 4 years ago

@mmoussallam thank you for addressing that. I have looked into it a little bit, I don't have much knowledge in anything tensorflow related but I'm learning little by little.

So what about the multiple machines?

In Spleeter file train.py at line 95 I can see tf.estimator.train_and_evaluate(... and when tracing the function train_and_evaluate it takes me to the file training.py which is located in the folder C:\Users\username\Anaconda3\Lib\site-packages\tensorflow_estimator\python\estimator.

Reading some of the comments I can see a whole bunch of info regarding distributed training. It seems to be very doable.

stickyninja3 commented 4 years ago

Hi all,

I think distributed training is easier with version 2 of tensorflow. The blogs i read all stated that tensorflow 1.14 / 1.13 don't share models across GPUs. It would be interesting to see what improvements could be made, but I can't even get training working on a single GPU. Nothing i have tried seems to work. It would be interesting to know the exact environments you use. I have been given my Dads old work laptop, which has a GTX1660. Going to reformat and try Ubuntu 18.04 now

aidv commented 4 years ago

@stickyninja3 I wonder how hard it would be to convert Spleeter code to use v2 of Tensorflow 🤔

Are you on Windows or MacOS? I'm on Windows and it's actually pretty easy to get it up an running.

Give me your email and I'll send you a message.

stickyninja3 commented 4 years ago

Hi aidv,

alecjclarke@live.co.uk is my email. I have tried using Windows but couldn't get training working. I have a laptop to use. It has 64Gb memory. Core i7 gen9. GeForce GTX 1660ti.

It would be great to get this working.

Thanks, Alec