Closed bobarker closed 6 years ago
Multi GPU shouldn't be slower, but often the GPUs themselves are not the limiting factor. In fact, most often, I find that bus bandwidth is a serious bottleneck. Some things may be necessary to get the best speeds.
These are the steps that I took to get the best speds, you may not need all of them, but each improved my own results.
Disabling SLI? I thought that would slow it down...why does that speed it up?
SLI is for games where it renders each frame alternating between the cards. We're not using any of that. We're sending a different set of pictures to each card separately and telling it to process them. SLI causes a slowdown of about 15-30% when it comes to ram operations which is most of what we're doing so actually leads to worse results.
Quite simply, we're using CUDA not the graphics part of the card. Gaming benefits from SLI, we benefit from discrete cards.
Ahhh...Sokath my eyes opened...
Sokath my eyes opened ... đź‘Ť Truly, this reference is worthy of being found on Github. Ha
@bobarker not sure if there is an actual recommended bacth size, but you can try increasing to see if it helps. Note that at some point the CPU can become the bottleneck, because CPU is used in generating input images
@bobarker did you test without SLI ?
Also i found many ppl report multi_gpu_model slower than single gpu.
https://github.com/keras-team/keras/issues/9204 https://github.com/kuza55/keras-extras/issues/21 https://github.com/avolkov1/keras_experiments/issues/13 https://stackoverflow.com/questions/47090096/why-my-training-speed-in-keras-with-multi-gpu-model-is-worse-than-single-gpu
Each of the links you post make the same mistake. Multi GPU will always be slower on a per-cycle basis. That is the same with all parallel operations, since it requires additional management to coordinate the multiple GPUs. The advantage is that you can increase the batch size to match the number of GPUs. That is where you get the speedup, not in the cycle speed, but in increased parallelization. In other words, it takes longer to go through a batch, but it can do a much larger batch in that time.
If they are following my multiple GPU recommendations, they've already increased the batch size for their additional GPUs so will be bypassing all the problems those links found. Any optimizations would of course be welcome.
also this https://stackoverflow.com/questions/48938728/understanding-keras-multi-gpu-model-training reports bad loss curve on multi_gpu_model.
Anyway, I implemented best GPU management on my total refactoring.
I welcome any improvements you can make. Multi gpu is hard to get right, what with the overhead and all, after all, transferring 250mb of weights to each gpu many times a second is a HUGE part of the overhead (but non optional since you need them to match on both GPU). My best ideas on this front involve using NVLink which wont work on consumer level cards or using split models. I thought of trying to use the SLI bridge, but apparently, you can't use that in CUDA. I don't think we can shrink the model to fit into a shader which we'd need to do to trick the GPU into sharing the results over SLI.
The link you just posted is seeing that result because they're reading the weights from the wrong place, that is exactly what happens if you try to read the weights from the multi_gpu_model without keeping the reference to the original (source single GPU) model. This comes about because the weights are not modified atomically and are being modified by the second GPU while being calculated. This is why you can see that the overall shape is the same (and the ending points are at the same place) but the spikes happen immediately between the points on the first one.
I hope that you do more research into how multi GPU actually works before you try to implement it in your refactor because a bad implementation is sure to cause bigger issues than it helps.
your PRs already causes bigger issues and messing code than helps that why I will never help this repo.
look at your commits https://github.com/deepfakes/faceswap/commits?author=bryanlyon&since=2018-02-28T20:00:00Z&until=2018-03-31T20:00:00Z you are making more useless words than help
@iperov I have tried with and without SLI. Same results both ways. I'm suspecting bus speeds to be the limiting factor since my CPU utilization stays under 80% on my 4790k but I really don't know. Both 1080ti's are using 8x 3.0 PCIe lanes which I would think would be enough...Thanks for the links, I read through them and with my limited knowledge of programming it seems that the problem hasn't been solved yet.
Also, I'm pretty sure I understand this correctly but when a batch size of say 128 is chosen, it is split between the cards putting 64 on both, correct? Or does it put a unique batch of 128 on each card as suggested (I believe that's what he meant) by @bryanlyon? If this was the case it would explain the slight drop in iterations per second but with double the batch size.
Multi GPU splits it so with a batch size of 128 you are putting 64 on each card. This is why you need to multiply your batch size from one card by the number of cards you are using. For example, If you are using a batch size of 128 on one card, then two cards you'll want to use 256 (3 cards 128*3 so 384 and so on). This is where the speed boost comes from. Each iteration will take the same amount of time (slightly more actually) but it'll do more pictures simultaneously.
@bryanlyon, I appreciate what you’ve done and you are correct- I can use much bigger batch sizes than before- which is what I typically do and that does speed up conversion. So thanks for your help. Haven't noticed a change with disabling SLI, but's also kinda hard to test the speed of a conversion unless I testing a model from nothing to the same point.
@iperov- dude I have seriously held off commenting because I really can’t contribute much to this because I am not a programmer and just learning to code python. I've also edited this post like 3 times because I don't want to be disrespectful. Typically, I feel its not my place to comment because you collaborators are the head dogs. But you have seriously got to either shut up or get off the pot. You seriously insult every guy on here who works hard for free to bring about a program for people to use. When someone points out a very real flaw in what you do (usually backed by others), you insult them and their intelligence. From what I can see, Clorr and others are building the essence of what it is to be part of a team that respects each other and their contributions. I submitted a pr on my puny skills, and the guys wanted something different and BryanLyon here made something better and amazing! I couldn't do it and he did. End of story. I was happy to even have been a tiny part of something as cool as this. That’s what teamwork is! It’s not gloating over “this is MINE”. Who cares! It’s not like you all get to be millionaires over this stuff! You've gotten to be a huge part! Either leave and go do your own thing or stay and learn to work with these dudes who are writing great things with these programs! And no, I am just a humble beginner(if even that) so maybe I have no voice here- but I do read EVERY single post on github regarding theses sites trying to learn and possible help, and I know how to be part of a team-even if I am just the water boy.
@kellurian vice versa, BryanLyon did almost nothing, but tells disinformation to everyone. He cant understand simple things, but think he knows all. BryanLyon is type of people who more speaking than working. I dont like them.
That’s what teamwork is!
There is no teamwork at all. People with various programming approaches contributing useless things, which result ruining code and repo is growing like a cancer.
I have 2 1080ti's. When I use "-g 2" both cards have utilization that bounces between 30-60%. Using just one I get almost 100% utilization. It also seems that training is about 2 to 3 times faster when using just a single card. Could there be something that I have set up wrong or am I missing something? Also, what is the recommended batch size for a 1080ti 11GB?