bartvm / nmt

Neural machine translation
MIT License
2 stars 2 forks source link

Multi-GPU #6

Open bartvm opened 8 years ago

bartvm commented 8 years ago

After talking to @abergeron again, it seems that multi-threading is off the table. The current approach will be multi-processing with shared memory. We will need to implement this, so we will eventually need a training script which as an argument can take the name of a shared memory region to which it writes and reads parameters every N batches using a particular method (I like EASGD).

We will then need to try and get the maximum speedup out of this. We'll need to do some benchmarking to see how this scales to 2, 4, or maybe even 8 GPUs. We also need a way to still measure validation error (I guess we could have a separate GPU that copies and saves the parameters to disk, and then performs a validation run, this way we can do early stopping).

anirudh9119 commented 8 years ago

So, what is currently the issue with multi-threading ?

bartvm commented 8 years ago

I'm not sure how familiar you are with multi-threading in Python and the horror of the GIL, but I think that's basically the problem i.e. Theano doesn't release the GIL often and long enough for Python to actually benefit from multiple threads. There might also be problems with Theano's host-device synchronisation points, which force the other threads to wait as well, but I'm not sure about that.

Either way, apparently using two threads actually resulted in a slowdown instead of a speedup, so we're going the multi-processing way (which is generally the recommended way in Python). This should be fine, just need to make sure that it is easy for the shared processes to read and write parameters to a block of shared memory, because other IPC methods will cause an unnecessary slowdown.

anirudh9119 commented 8 years ago

Keeping the parameters same, (number_of_samples, batch_size, number_of_epochs etc), I ran the MultiGPU code on 1,2,4 GPU's. Here are the timings for each of them

Training time(max_mb) 117000.37121s (single GPU) Training time (max_mb) 60079.830221s (for 2 GPU's) Training time (max_mb) 56594.253144s (for 4 GPU's)

bartvm commented 8 years ago

Awesome! It's too bad to see the diminishing returns so quickly (small difference between 2 and 4) but the 2x speedup is a huge gain already. Perhaps for more GPUs the number of updates between parameters synchronisation needs to be changed? Did you play with that? On Jan 2, 2016 22:34, "Anirudh Goyal" notifications@github.com wrote:

Keeping the parameters same, (number_of_samples, batch_size, number_of_epochs etc), I ran the MultiGPU code on 1,2,4 GPU's. Here are the timings for each of them

Training time(max_mb) 117000.37121s (single GPU) Training time (max_mb) 60079.830221s (for 2 GPU's) Training time (max_mb) 56594.253144s (for 4 GPU's)

— Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/6#issuecomment-168429712.

anirudh9119 commented 8 years ago

Definitely, it improved but validation error was more in that case. So I didn't pay much attention to that. Probably, I should thoroughly play with the parameter.

Right now, I was synchronising after every iteration. (The results, I gave you on)

Do you think something else could be useful too ?

anirudh9119 commented 8 years ago

I performed another experiment with 5 GPU's

Training time(max_mb) 117000.37121s (single GPU) Training time (max_mb) 60079.830221s (for 2 GPU's) Training time (max_mb) 56594.253144s (for 4 GPU's) Training Time(max_mb) 48158.034835s (for 5 GPU's)

I don't exactly know now, why it performed somewhat better though.

bartvm commented 8 years ago

Do you mean the validation error increased as you increase the number of iterations between parameter synchronization? That's interesting. If you look at the EASGD paper you'll see that they have experiments with the communication period as high as 64, and it actually decreases their validation error. I guess that is because they are overfitting on CIFAR, and a higher communication period acts as a regulariser, whereas we might still be underfitting? Do you have training curves? In that case we could consider training with high communication periods at first to get very quickly to a reasonable error, and then fine-tune with a lower communication period (or fewer GPUs) later on.

Is that behaviour with 5 GPUs reproducible by the way? 2 to 4 GPUs decreases the training time with 5%, and then 4 to 5 reduces it with 15%?

anirudh9119 commented 8 years ago

Yes, validation error increased as I increased the number of iterations b/w synchronization. I don't have any training curves (right now). I will probably run the experiments again keeping communication period as a variable, and see what happens.

Yes, Experiment with 5 GPU's is reproduciblle(I ran 2 times, and the timings were 48,765s and 48158s resp).

Yesterday, I ran with 6 GPU's.

Training time(max_mb) 117000.37121s (single GPU) Training time (max_mb) 60079.830221s (for 2 GPU's) Training time (max_mb) 56594.253144s (for 4 GPU's) Training Time(max_mb) 48158.034835s (for 5 GPU's) Training Time(max_mb) 41411.077010s (for 6 GPU's)

anirudh9119 commented 8 years ago

I got a possible explanation for the weird thing happening. I was using kepler GPU, and when I was using 4 GPU's, I used 3 which were in one box and the other which was in other box, and when I saw the communication times b/w the 2 boxes it was 10 times slower as compared to within 1 box, and hence not enough speedup while using 4 GPU's. I actually asked Fred and he confirmed that it may be possible that it is the reason due to this behaviour.

nouiz commented 8 years ago

Just a clarification, this isn't between computer node. It currently only support 1 compute node.

This is on a node that have 2 CPUs. This was between GPUs attached to different CPUs. The communications between those is much slower.

Comparing this to when the 4 GPUs are on the same CPU, is needed to know if that is the difference.

On Wed, Jan 6, 2016 at 4:11 PM, Anirudh Goyal notifications@github.com wrote:

I got a possible explanation for the weird thing happening. I was using kepler GPU, and when I was using 4 GPU's, I used 3 which were in one box and the other which was in other box, and when I saw the communication times b/w the 2 boxes it was 10 times slower as compared to within 1 box, and hence not enough speedup while using 4 GPU's. I actually asked Fred and he confirmed that it may be possible that it is the reason due to this behaviour.

— Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/6#issuecomment-169462210.

anirudh9119 commented 8 years ago

Thank you Fred for the update.

nouiz commented 8 years ago

See: https://github.com/mila-udem/platoon/pull/14

Can you confirm that your timing was for a fixed number of iteration? The timing done by @abergeron was done on kepler computer and was on the gpu3 to gpu7. So on the same group of GPU with efficient communication.

nouiz commented 8 years ago

The link above tell that when you raise the number of worker, you must at least update the learning rate and the alpha parameter of easgd to keep efficient learning.

anirudh9119 commented 8 years ago

Yes, For a fixed number of iterations. I kept the learning rate constant but yes, I changed the alpha parameter of easgd, as (1/number_of_workers).

Also, from my experiments, increasing batch size with increasing number of workers did not help at all, infact it worsened the situation, (for me batch size of 32 worked well), changing it to 128, or 64 did not help.

nouiz commented 8 years ago

raising the number of worker can be seen in some view as similar to raising the batch sizes. Can you add as a comparison point 2 GPU with batch size 32(alredy done) and 64(todo) to compare again 4 GPU training efficiency with batch size 32 and 64?

Or you need to change the learning rate or try other alpha parameter.

anirudh9119 commented 8 years ago

Okay, I'll do the experiments and update here.

anirudh9119 commented 8 years ago

Using 4 GPU's, keeping batch size of 64, using alpha parameter = 0.5 --- 74041.303532s (all the 4 GPU's were on the same CPU) Using 4 GPU's, keeping batch size of 32, using alpha parameter = 0.5 --- 54123.873643s (all the 4 GPU's were on the same CPU)

nouiz commented 8 years ago

The comment with the timming for 1,2,4,5,6 GPUs was with which batch size? 32?

What is this timing the time for a fixed number of mini-batch or the time to get a certain error?

When you used a batch size of 64, did you see in total the same number of example? I suppose not with the result you gave.

Can you run the timing with batch size 64, but with a fixed total number of example seen, not a fixed number of batch seen. So this mean, half the number of batch seen.

Can you confirm that the timing is wall clock time between the start and end of training?

A question, do you use the "valid" command from the Controler or you just train? See https://github.com/mila-udem/platoon/issues/15

anirudh9119 commented 8 years ago

Yes, For rest of my experiments were with batch size = 32 This time, is for Max number of minibatches to train on.

I saw the same number of minibatches while runing with batch size of 64.

Yes, the timing is wall clock b/w the start and end of training.

Yes, I do use the valid command from the controler

bartvm commented 8 years ago

So with batch size 64 it's actually slower than using 2 GPUs? That's interesting, because in the EASGD paper they use batches of size 128...

I'm just looking at the paper now, and instead of specifying α, they seem to specify β = 0.9, where β = α * p (p is the number of workers). This means that they scale α with the number of workers as you did.

Note that choosing β = pα leads to an elastic symmetry in the update rule, i.e. there exists an symmetric force equal to α(x^i_t − x_t) between the update of each x^i and x. It has a crucial influence on the algorithm’s stability as will be explained in Section 4.

However, with 4 workers, β = 0.9 implies α = 0.225, quite a bit lower, right?

There is also footnote that reads:

Intuitively the ’effective β’ is β/τ = pα = pηρ (thus ρ = β/τpη) in the asynchronous setting.

This means that as we increase τ, the communication period, we would have to decrease the learning rate and/or α similarly in order to keep ρ (the amount of exploration done) similar.

Is there a chance you could put these results in a spreadsheet somewhere, including the values for α, β, τ, p and η?

anirudh9119 commented 8 years ago

Yes, I will put these results in a spreadsheet , including the values for α, β, τ, p and η. I also tried to experiment with 4 GPU's and batch size of 128, but for me, it took > 80,000s, so I killed the experiment and alpha = 0.5.

nouiz commented 8 years ago

alpha was kept constant at 0.5 in all those experiments. This do not check the efficiency of learning, just the efficiency of computation.

WIth batch size of 64, as it see the same number of batch, it is normal that it is taking longer, he see more example.

I think we should understand why there is no more computation efficency then he currently have with 4 GPUs before using 4 GPUs in real experiments. Can you add timing in each worker to print the time spend training, vs time spent syncing vs time spend waiting for the lock before syncing?

Also, as we check the efficiency of computation, you can lower the number of minibatch by probably 10, to make testing this faster.

On Fri, Jan 8, 2016 at 3:10 PM Bart van Merriënboer < notifications@github.com> wrote:

So with batch size 64 it's actually slower than using 2 GPUs? That's interesting, because in the EASGD paper they use batches of size 128...

I'm just looking at the paper now, and instead of specifying α, they seem to specify β = 0.9, where β = α * p (p is the number of workers). This means that they scale α with the number of workers as you did.

Note that choosing β = pα leads to an elastic symmetry in the update rule, i.e. there exists an symmetric force equal to α(x^i_t − x_t) between the update of each x^i and x. It has a crucial influence on the algorithm’s stability as will be explained in Section 4.

However, with 4 workers, β = 0.9 implies α = 0.225, quite a bit lower, right?

There is also footnote that reads:

Intuitively the ’effective β’ is β/τ = pα = pηρ (thus ρ = β/τpη) in the asynchronous setting.

This means that as we increase τ, the communication period, we would have to decrease the learning rate and/or α similarly in order to keep ρ (the amount of exploration done) similar.

Is there a chance you could put these results in a spreadsheet somewhere, including the values for α, β, τ, p and η?

— Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/6#issuecomment-170110748.

bartvm commented 8 years ago

Right, I was thinking of efficiency of learning, not computation. I forgot that the number of epochs was fixed.

Why does it see more examples when the batch size is 64? I thought that one epoch would be defined as each example being seen exactly once by a GPU, in which case the batch size doesn't actually change the number of total examples seen.

anirudh9119 commented 8 years ago

It was my mistake, I was training it for maximum number of mini-batches seen, and not according to the number of epochs. Nevertheless, I ran only 2 experiments with batch size 64, and other with batch size 128. Rest were reported with the batch_size = 32 only.

bartvm commented 8 years ago

Okay, I just noticed you mentioned that above, skimmed too quickly, sorry!

So in that case I guess @nouiz is right, and we really need to figure out why the 4 GPU case is so slow.

You can print timings with these classes from Blocks by the way: https://github.com/mila-udem/blocks/blob/master/blocks/utils/profile.py

profile = Profile()
with Timer('training', profile):
    # Training code
with Timer('sync', profile):
   # Sync code

# etc

profile.report()
anirudh9119 commented 8 years ago

So, Just to be on the same page, I am running now, with batch_size = 32, alpha = 0.5 on 4 GPU's.

anirudh9119 commented 8 years ago

Summary Timings - (For equal number of examples) @bartvm @nouiz 1 GPU batch_size = 32 - Time - 9394.34(s) 1 GPU batch_size = 64 - Time - 5952.05(s) 2 GPU batch_size = 32, alpha = 0.50 - Time - 8120.67(s), Time - 7999.23(s) [I ran 2 experiments] 2 GPU batch_size = 64, alpha = 0.50 - Time - 6354.56(s) 4 GPU batch_size = 32, alpha = 0.50 - Time - 5370.42(s) 4 GPU batch_size = 64, alpha = 0.50 - Time - 3081.24(s) 4 GPU batch_Size = 64, alpha = 0.25 - Time - 3232.01(s) 4 GPU batch_Size = 64, alpha = 0.70 - Time - 3160.19(s)

1 GPU batch_size=32 Time - 9394.345238(s)

1 GPU batch_size = 64 Time - 5952.051104(s)

2 GPU's batch_size = 32, alpha = 0.5 [same number of examples] -- I ran 2 experiments with this, and the results were consistent. Time - 8120.67(s)

Worker_Num Time_training Time_syncing Time_waiting (in seconds)

        W_1          6884.59                1172.77             28.16
        W_2          6873.47                1171.27             28.27

Time - 7999.229232(s)

Worker_Num Time_training Time_syncing Time_waiting (in seconds)

       W_1           6782.44            1150.57               29.77 
       W_2           6857.51            1053.37               29.06

2 GPU's batch_size = 64, alpha = 0.5 [same number of examples] Time - 6354.56(s)

Worker_Num Time_training Time_syncing Time_waiting (in seconds)

     W_1              5507.50              716.70               14.56
     W_2              5469.34              731.22               13.46

4 GPU's batch_size = 32, alpha = 0.5 Time - 5370.42(s)

Worker_Num Time_training Time_syncing Time_waiting (in seconds)

        W_1                4440.29           897.11             15.66
        W_2                4517.52           806.63             13.92
        W_3                4551.05           769.94             12.54
        W_4                4425.10           887.74             15.40

4 GPU's batch_size = 64, alpha = 0.25 Time - 3232.01(s)

Worker_Num Time_training Time_syncing Time_waiting

W_1                   2711.05             268.27               7.12            
W_2                   2910.26             303.93               8.32
W_3                   2713.53             266.70               7.02 
W_4                   2730.54             244.57               7.97

4 GPU's batch_size = 64, alpha = 0.5 Time - 3081.240133(s)

Worker_Num Time_training Time_syncing Time_waiting

 W_1                  2767.52             271.09               7.07
 W_2                  2780.05             285.09               7.77
 W_3                  2779.13             257.14               7.60   
 W_4                  2757.79             277.79               7.40

4 GPU's batch_size = 64, alpha = 0.75 Time - 3160.19(s)

Worker_Num Time_training Time_syncing Time_waiting

 W_1                  2782.49             278.12               7.53
 W_2                  2841.64             300.94               8.22
 W_3                  2781.12             276.35               7.27   
 W_4                  2765.44             286.09               8.01
nouiz commented 8 years ago

Those numbers give good speed up of 4 GPU vs 2. What changed? Le 8 janv. 2016 18:40, "Anirudh Goyal" notifications@github.com a écrit :

4 GPU's batch_size = 32, alpha = 0.5

Time - 5370.42(s)

Worker_Number Time_training Time_syncing Time_waiting (in seconds) W_1 4440.29 897.11 15.66 W_2 4517.52 806.63 13.92 W_3 4551.05 769.94 12.54 W_4 4425.10 887.74 15.40

2 GPU's batch_size = 32, alpha = 0.5 [same number of examples]

Time - 8120.67(s)

Worker_Number Time_training Time_syncing Time_waiting (in seconds)

    W_1          6884.59                1172.77             28.16
    W_2          6873.47                1171.27             28.27

2 GPU's batch_size = 64, alpha = 0.5 [same number of examples]

Time - 6354.557928(s)

Worker_Number Time_training Time_syncing Time_waiting (in seconds) W_1 5507.50 716.70 14.56 W_2 5469.34 731.22 13.46

— Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/6#issuecomment-170157541.

bartvm commented 8 years ago

So to summarize, with batch size 64:

And for batch size 32:

So in this case we see a significant speedup from 2 to 4, but from 1 to 2 we see very little speedup and even a slowdown when batches of size 64? In the original runs we saw a significant speedup:

The repeated tests seem to suggest that the variance isn't actually that high, so it's not just a measurement error I guess. So what was different that made the original experiments give a speedup while these new experiments don't see any?

anirudh9119 commented 8 years ago

I didn't changed anything, I just ran the same old code with less number of mini batches.

anirudh9119 commented 8 years ago

I ran experiments again with batch_size = 64,

1 GPU - 9718.92s, 9512.32s, 9634.92s 2 GPU - 6523.41s, 6453.21s, 6612.12s

bartvm commented 8 years ago

Great, those numbers make more sense! The speedup from 1 to 2 is lower because of synchronization and locks, but from 2 to 4 it's almost linear it seems. I guess there was just a bug in the earlier 1 GPU runs?

anirudh9119 commented 8 years ago

I skimmed through the logs for the previous version of single GPU, everything there seems to be fine according to the numbers, but now its consistent!

nouiz commented 8 years ago

I'm not convinced there is no more issue. But the issue could be outside platoon!

It could that that we need to select which CPU is used when a given GPU is used on dual CPU computer, like the keplers.

On Wed, Jan 13, 2016 at 10:58 AM, Anirudh Goyal notifications@github.com wrote:

I skimmed through the logs for the previous version of single GPU, everything there seems to be fine according to the numbers, but now its consistent so no issues!

— Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/6#issuecomment-171341790.

anirudh9119 commented 8 years ago

4 GPU - 3322.83s, 3423.21s (Batch_size - 64)

Makes sense to me. Now I would be going for hyperparameter search (α, β, τ, p and η) with respect to training and validation error, but the hyperparameters would vary with number of GPUs, so I am plannning to go with 4 GPU's ? Okay?

anirudh9119 commented 8 years ago

ASGD is not doing better than EASGD for NMT, according to the validation error after training on certain number of minibatches. I tested it for 1,2,4 GPU's, and let them train for 2 days. And if I ignore the validation error and just consider according to the computation time, both are approximately the same.

One of the reasons as @bartvm pointed out can be, that may be the dataset I am using (europarl) is small, and I should just move to a bigger dataset.

anirudh9119 commented 8 years ago

This is kind of summary for the best hyperparameters till now.

https://docs.google.com/document/d/1vhc8JZlsm5RDHAX-0u2g7KCto2X5g2M1pvPKtfd5vOQ/edit?usp=sharing