New Feature: Multithreaded Training

massaroni commented 5 years ago

A GIF or MEME to give some spice of the internet

(I'm opening a new issue for this to start a conversation before submitting a pull request, so please let me know what you think.)

This adds new functionality to trainAsyc(), so that NodeJS users can utilize multiple gpus and cpus to train a single NeuralNetwork. This should significantly speed up training if you have a large neural net and/or a large training data set.

Is this a feature that we would want to merge into develop? [y/n]

Code

This branch, based on master, has a working example: massaroni/feature-parallel-training-m This other branch is mergeable into develop, but develop is too unstable at this point to demo the multithreaded training. massaroni/feature-parallel-training

See the example in parallel-trainer-example.js. It basically just shows that the algorithm does converge. See the main functionality in parallel-trainer.js

Documentation

trainAsync() in parallel mode, can train a single net on multiple threads. This should speed up training for large nets, large training sets, or both.

Train a NeuralNetwork on 3 cpu threads.

  const net = new brain.NeuralNetwork();
  net
    .trainAsync(data, {
      parallel: {
        threads: 3,
        partitionSize: 1500, // optional. send a partition of 1500 items from the training set to each thread.  Raise this number to get some overlap in the training data partitions.
        epochs: 20000, // optional. limit each thread to 20,000 training runs
      },
      // ... and the usual training options
    })
    .then(res => {
      // do something with my trained network
    })
    .catch(handleError);

Train a NeuralNetwork on 6 cpu threads and 2 GPU threads.

  const net = new brain.NeuralNetwork();
  net
    .trainAsync(data, {
      parallel: {
        threads: {
          NeuralNetwork: 6,
          NeuralNetworkGPU: 2
        }
      },
      // ... and the usual training options
    })
    .then(res => {
      // do something with my trained network
    })
    .catch(handleError);

Train a single NeuralNetwork on 6 cpu threads and 2 GPU threads, and send 10x more training data to the GPUs because they can run through it faster.

  const net = new brain.NeuralNetwork();
  net
    .trainAsync(data, {
      parallel: {
        threads: {
          NeuralNetwork: {
            threads: 6,
            trainingDataSize: 2200
          },
          NeuralNetworkGPU: {
            threads: 2,
            trainingDataSize: 22000
          }
        }
      },
      // ... and the usual training options
    })
    .then(res => {
      // do something with my trained network
    })
    .catch(handleError);

Roadmap

support all other neural net types
web workers, for multithreaded training in the browser
distributed training (multiple machines) (async SGD w/stale gradient handling?)

mubaidr commented 5 years ago

Well this sounds great! Just to update you GPU support is already on the way which will make brain.js super fast both in browser and nodejs environment, without requiring anything form the user side.

Coming back to this implementation, I would love hear how you are implementing this feature, like theoretically and does it actually works (it should reduce iterations or training time of the network)?

In my quick tests this does not seem to help, in both cases training iterations are more or less same: https://repl.it/repls/WindyBossySymbol

output:

iterations: 100, training error: 0.25818311882582456
iterations: 200, training error: 0.25800706443927357
iterations: 300, training error: 0.2578366663269325
iterations: 400, training error: 0.2576723716144711
iterations: 500, training error: 0.2575143128648252
iterations: 600, training error: 0.2573622537363729
iterations: 700, training error: 0.2572154941867293
iterations: 800, training error: 0.25707281095860957
iterations: 900, training error: 0.25693213970593565
iterations: 1000, training error: 0.25679010187229867
iterations: 1100, training error: 0.2566410622835251
iterations: 1200, training error: 0.2564750828459159
iterations: 1300, training error: 0.2562730074033759
iterations: 1400, training error: 0.2559948550585196
iterations: 1500, training error: 0.2555482658486705
iterations: 1600, training error: 0.2547033046119731
iterations: 1700, training error: 0.25288047943778835
iterations: 1800, training error: 0.24880152404029698
iterations: 1900, training error: 0.24041909038012865
iterations: 2000, training error: 0.22613661489458114
iterations: 2100, training error: 0.2085759876262676
iterations: 2200, training error: 0.19321029472878642
iterations: 2300, training error: 0.1805968646341638
iterations: 2400, training error: 0.1681289670890776
iterations: 2500, training error: 0.15277506975308214
iterations: 2600, training error: 0.13133523941899217
iterations: 2700, training error: 0.10298459101445756
iterations: 2800, training error: 0.07308997065739647
iterations: 2900, training error: 0.04928476756839584
iterations: 3000, training error: 0.03366073295542782
iterations: 3100, training error: 0.02403211479609829
iterations: 3200, training error: 0.018004526605472707
iterations: 3300, training error: 0.014065349103011215
iterations: 3400, training error: 0.011368400954631375
iterations: 3500, training error: 0.009442459782012247
iterations: 3600, training error: 0.008016518676355791
iterations: 3700, training error: 0.006928103610264663
iterations: 3800, training error: 0.006075717862402963
iterations: 3900, training error: 0.0053935231070073725
{ error: 0.004998328924858023, iterations: 3969 }
normal: 5792.452ms
iterations: 100, training error: 0.2583618614108776
iterations: 200, training error: 0.2581863411353922
iterations: 300, training error: 0.258014347597987
iterations: 400, training error: 0.25784683062415414
iterations: 500, training error: 0.2576844676886525
iterations: 600, training error: 0.257527535550679
iterations: 700, training error: 0.2573757396917138
iterations: 800, training error: 0.2572280633955959
iterations: 900, training error: 0.2570824604326649
iterations: 1000, training error: 0.2569354179554155
iterations: 1100, training error: 0.2567806937857542
iterations: 1200, training error: 0.2566068921395175
iterations: 1300, training error: 0.25639189843085575
iterations: 1400, training error: 0.2560892586667549
iterations: 1500, training error: 0.25559406375235505
iterations: 1600, training error: 0.2546562652210833
iterations: 1700, training error: 0.2526846248860013
iterations: 1800, training error: 0.24843476345960558
iterations: 1900, training error: 0.2398532284194126
iterations: 2000, training error: 0.225266258589784
iterations: 2100, training error: 0.2075494480327098
iterations: 2200, training error: 0.19200110866068304
iterations: 2300, training error: 0.17873487036086755
iterations: 2400, training error: 0.16502797542351472
iterations: 2500, training error: 0.1476880848635918
iterations: 2600, training error: 0.12383270000032548
iterations: 2700, training error: 0.0943299946069216
iterations: 2800, training error: 0.06581099531372298
iterations: 2900, training error: 0.04443505464640163
iterations: 3000, training error: 0.03069075213928414
iterations: 3100, training error: 0.022194948138988882
iterations: 3200, training error: 0.01681723219724285
iterations: 3300, training error: 0.01325963734705865
iterations: 3400, training error: 0.010796844203295422
iterations: 3500, training error: 0.009021366138514925
iterations: 3600, training error: 0.0076962426081793045
iterations: 3700, training error: 0.006677933082800236
iterations: 3800, training error: 0.005875889767725823
iterations: 3900, training error: 0.005230826511272005
{ error: 0.0049966083428647536, iterations: 3942 }
parallel: 5756.476ms

Am I missing something?

massaroni commented 5 years ago

Thanks @mubaidr, this is based on parameter averaging and data parallelization. It's probably the most naive implementation possible, but that's a good start because it's easy to test, and it's running on a single machine anyway. The more sophisticated algorithms are mostly trying to deal with architectural challenges like I/O overhead and mismatching machines, so maybe we can still benefit from the naive implementation, on a single machine.

Basically, it splits the training data into partitions, one per thread, and each thread has a clone of the neural net. Each thread trains on its own partition, and then the trained nets are averaged together (mean average of corresponding weights in the nets). Then each thread is re-seeded with clones of the averaged net, rinse and repeat.

I think overall we can expect that compared to single threaded training, this algorithm is always going to run through more total iterations. Ideally it should finish with fewer iterations per thread, so that training is faster. Along the way, each thread is converging in a slightly different direction, toward a local minimum in its assigned partition. If your data set has dramatic local minima, then you can configure the partitions to have some overlap, and I think that should help.

That said, the xor data is a poor use case for multithreaded training because it's so small and the local minima are pretty deep. There are only 4 training data points, so the Repl.it example with 8 cpu threads doesn't even have enough data for 1 training point per thread. I think that the only value in this example is just to show that it does converge at all.

I'd like to run some benchmarks on a large data set to quantify the performance gains. Do you have a favorite large example data set that I can test it on? My personal use case is too messy to publish here.

mubaidr commented 5 years ago

Thanks @mubaidr, this is based on parameter averaging and data parallelization. It's probably the most naive implementation possible, but that's a good start because it's easy to test, and it's running on a single machine anyway. The more sophisticated algorithms are mostly trying to deal with architectural challenges like I/O overhead and mismatching machines, so maybe we can still benefit from the naive implementation, on a single machine.

Basically, it splits the training data into partitions, one per thread, and each thread has a clone of the neural net. Each thread trains on its own partition, and then the trained nets are averaged together (mean average of corresponding weights in the nets). Then each thread is re-seeded with clones of the averaged net, rinse and repeat.

Interesting. Thanks for the explanation. 👍

That said, the xor data is a poor use case for multithreaded training because it's so small and the local minima are pretty deep. There are only 4 training data points, so the Repl.it example with 8 cpu threads doesn't even have enough data for 1 training point per thread. I think that the only value in this example is just to show that it does converge at all.

I'd like to run some benchmarks on a large data set to quantify the performance gains. Do you have a favorite large example data set that I can test it on? My personal use case is too messy to publish here.

Well in that case I believe some image data based training or something like this would be helpful to this behavior: https://jsfiddle.net/8Lvynxz5/38/ I will do some testing in my free time.

Keep up the great work!

massaroni commented 5 years ago

Thanks @mubaidr, I found some image data sets on these sites, below. I'll pick a good one and run some benchmarks. Based on my schedule this week, I'll probably have an update about my findings in a few days.

deeplearning.net - Datasets skymind - Open Datasets

Rocketblaster247 commented 5 years ago

Anything to make training faster!

massaroni commented 5 years ago

My findings:

It looks like the multithreaded trainer can get better performance than a single thread. The results are encouraging so far: with a low learning rate and good tuning, I'm getting ~10,000x better performance, which is very surprising, and with normal learning rates, I'm getting a roughly sub-linear performance gain wrt thread count, as expected. This is by no means an exhaustive analysis, so I'm presenting it as an Alpha, looking forward to getting some feedback. I could go nuts charting results in all different scenarios, but I think this is a good proof of concept so far, and I want to get some feedback about it first.

Methods: In this benchmark, performance is measured both in wall clock time, and item-iterations per thread (data size * iterations). Compared to single-threaded training, I expected multithreaded training to get a lower wall clock time, more total iterations, and fewer item-iterations per thread. I expected the performance boost to be roughly linear, proportional to the number of threads.

This feature is designed for a large data set or a large net, or both. So at first, I started testing on the MNIST Hand Written Digit database, which is rather large, and requires a large net to process it. However, my cycle time was too slow, so I had to find a smaller data set to work with. I ended up doing a simple math function approximation, because it was easy to generate the data on the fly and quick to test that it'll converge on a single thread. Then I dialed down the learning rate, simply to necessitate more training iterations so that we can show a more robust comparison. Note that I did commit the code that reads in the MNIST database, so that's still included in this branch.

Conclusions:

With a low learning rate, I'm getting multiple orders of magnitude better performance. With more common learning rates, I'm getting a modest linear performance boost, as expected.
Multithreaded training requires new tuning considerations, and with bad tuning, it could be slower than a single thread or not converge at all.
New tuning considerations: 1) When your data set is partitioned, each partition should have enough data for a net to train on it in isolation and still converge. If you don't have enough redundancy in your data set, then you can configure the partitions to overlap. This also suggests that there's a limit to the number of threads you can utilize, depending on the data set and partitioning.

2) You want to limit the iterations for trainer threads to a very low number, so that they can merge their results frequently enough. I had good results with a per-thread iteration limit of 10 or less.

You can run the benchmarks yourself, with this script: benchmark/index.js (based on master) benchmark/index.js (based on develop)

node benchmark/

and you should get some results like this:

////// Benchmark Results //////
Single-thread 0.0001 LR
     runtime =  754 seconds
     item iterations per thread =  154380000
     error =  0.004999092472072885
     test error =  0.005060350515187244
2 Threads 0.0001 LR Overlapping Partitions
     runtime =  11 seconds
     item iterations per thread =  17600
     error =  0.004849751772040714
     test error =  0.006833252998457533
4 Threads 0.0001 LR Overlapping Partitions
     runtime =  6 seconds
     item iterations per thread =  10000
     error =  0.004585786514230911
     test error =  0.007337232050197451
Single-thread 0.001 LR
     runtime =  6 seconds
     item iterations per thread =  15000
     error =  0.004750009322000309
     test error =  0.00805988373712574
2 Threads 0.001 LR Overlapping Partitions
     runtime =  5 seconds
     item iterations per thread =  8000
     error =  0.0030617683728257285
     test error =  0.00814376965453241
Single-thread 0.01 LR
     runtime =  3 seconds
     item iterations per thread =  5000
     error =  0.000601457041516575
     test error =  0.014998172604486229
2 Threads 0.01 LR Overlapping Partitions
     runtime =  1 seconds
     item iterations per thread =  3200
     error =  0.0011902072636906383
     test error =  0.00786050391168138

Thoughts?

robertleeplummerjr commented 5 years ago

I think this is simply fantastic! I'm trying to focus my efforts on getting GPU and the new api for network composition finished and tested. I say continue and when you get your end more polished and mine as well, we'll converge?

Curious, in brain we use "iterations", would you be opposed to changing "epochs" to match? That is of course if they are synonymous.

robertleeplummerjr commented 5 years ago

The amount of typos that I had to correct from my own typing had me believing my phone to be possessed...

Joinfield commented 5 years ago

Just amazing! Especially multi threaded GPU training!

massaroni commented 5 years ago

Thanks guys!

Curious, in brain we use "iterations", would you be opposed to changing "epochs" to match? That is of course if they are synonymous.

Thanks @robertleeplummerjr, they are roughly synonymous, but now there are two levels of "iterations" to consider: on a low level, each thread is running a synchronous net.train() function, which runs through multiple iterations as usual. On a higher level, there's a loop that aggregates the results of each thread, and re-seeds a new batch of threads, and I was calling that higher level the "epochs".

I'll rename it so that it's more consistent, maybe "iterations" for the high level and "iterationsPerThread" for the lower level? IMHO, BrainJS's user friendly framework is initially one of its biggest attractions, so I want to make sure that multithreaded training is equally easy to use.

robertleeplummerjr commented 5 years ago

IMHO, BrainJS's user friendly framework is initially one of its biggest attractions, so I want to make sure that multithreaded training is equally easy to use.

I love it!

robertleeplummerjr commented 5 years ago

I'll rename it so that it's more consistent, maybe "iterations" for the high level and "iterationsPerThread" for the lower level?

I like the clarity there!

massaroni commented 5 years ago

Update: I got multithreaded training working for LSTMTimeStep now too. In this benchmark, it looks like it trains about 3x faster than the single threaded trainer, and in this case it doesn't get any faster when you throw more than 2 threads at it. I think that's just a function of the complexity and size of the dataset and the size of the hidden layers in the net. I would expect that larger more complex data sets and larger nets would benefit from more threads.

Here's some typical results, running on a 2019 Macbook Pro 2.6 GHz Intel Core i7.

LSTMTimeStep 4 Threads 0.00005 LR
     runtime =  22 seconds
     item iterations per thread =  9328
     error =  0.0001979265328762787
     test error =  0.00830609840737476
LSTMTimeStep 3 Threads 0.00005 LR
     runtime =  20 seconds
     item iterations per thread =  9900
     error =  0.0001997905798877278
     test error =  0.013913493331929247
LSTMTimeStep 2 Threads 0.00005 LR
     runtime =  20 seconds
     item iterations per thread =  8460
     error =  0.00019971834864312185
     test error =  0.006952784224905751
LSTMTimeStep Single Thread 0.00005 LR
     runtime =  59 seconds
     item iterations per thread =  416000
     error =  0.00019653613155242052
     test error =  0.009126474910225139

see benchmark/index.js.

goferito commented 4 years ago

What's the situation with this? Is it merged?

mubaidr commented 4 years ago

Not really. But newer version has GPU integration which is much faster than this implementation.

goferito commented 4 years ago

Seems to be broken at the moment: https://github.com/BrainJS/brain.js/issues/364#issuecomment-603975334

mubaidr commented 4 years ago

Its in Beta state, expected to release very soon.

We do have plans to implement this too. This might help users with powerful CPUs without GPU.

goferito commented 4 years ago

If I understood it correctly it even allows using both CPU and GPU, right? That would be really cool. Thanks a lot guys for doing this.

mubaidr commented 4 years ago

Yes, exactly. But how much it will actually effect performance when using GPU (GPU is already many times faster than cpu) is yet to be seen.

bor8 commented 4 years ago

I would like to donate something if I could use four cores at the same time instead of one, in my case! Is this also intended for LSTM (not LSTMTimeStep)?

blackforest-t commented 4 years ago

I would like to donate something if I could use four cores at the same time instead of one, in my case! Is this also intended for LSTM (not LSTMTimeStep)?

i'll like to know about it too

unicorn-style commented 2 years ago

problems...1 thread faster than 1+

vm vcpu 4 esxi node 16 type modules change to commonjs for working trying my dataset and my script – 1 faster then when you set configuration with .... parallel...

Any thoughts?

////// Benchmark Results ////// LSTMTimeStep 4 Threads 0.00005 LR runtime = 2002 seconds item iterations per thread = 80675 error = 0.00019999808332483684 test error = 0.01930622798077797 LSTMTimeStep 3 Threads 0.00005 LR runtime = 2357 seconds item iterations per thread = 108120 error = 0.0001999364159483876 test error = 0.01775701457418095 LSTMTimeStep 2 Threads 0.00005 LR runtime = 716 seconds item iterations per thread = 53340 error = 0.0001997283394060407 test error = 0.01856114064901209 LSTMTimeStep Single Thread 0.00005 LR runtime = 3973 seconds item iterations per thread = 95746200 error = 0.00019995401834603388 test error = 0.014280172646737196 2 Threads 0.0001 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 19200 error = 0.004623700066624818 test error = 0.006383759797960875 4 Threads 0.0001 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 15500 error = 0.003654747346571941 test error = 0.007451617616469838 Single-thread 0.001 LR runtime = 0 seconds item iterations per thread = 0 error = 0.003913228313435196 test error = 0.007906918111712248 2 Threads 0.001 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 8000 error = 0.004012907401541029 test error = 0.006790354935259165 Single-thread 0.01 LR runtime = 0 seconds item iterations per thread = 0 error = 0.0005127324836436953 test error = 0.014539566666571174 2 Threads 0.01 LR Overlapping Partitions runtime = 0 seconds item iterations per thread = 3200 error = 0.00024371662900630114 test error = 0.008307541957539067

richiedevs commented 2 years ago

I'd love Multithreaded Training

imkane commented 1 year ago

Can't wait for this to be available for LSTMTimeStep :grin:

shestakov-vladyslav commented 1 year ago

Desired

gokaybiz commented 1 year ago

I would like to donate something if I could use four cores at the same time instead of one, in my case! Is this also intended for LSTM (not LSTMTimeStep)?

Same here in 2023

BrainJS / brain.js