AccelerateHS / accelerate

Embedded language for high-performance array computations
https://www.acceleratehs.org
Other
898 stars 117 forks source link

[BUG] Unexpectedly long phases when training a neural network #475

Open avarsh opened 4 years ago

avarsh commented 4 years ago

Description I created a small neural network comparing both accelerate and hmatrix to perform the matrix calculations, and trained it for 100 epochs (iterations), but found that it took several seconds to train on the accelerate backend, as opposed to a few milliseconds when using hmatrix.

With 2 epochs, the debug output from the code was: https://gist.github.com/avarsh/8bdb89f80d3987c9f4aea52c4d7a7149 while with 100 epochs of training, the output becomes: https://gist.github.com/avarsh/89de99cd4869649f523db681105a90b7. In the latter output, some phases, such as array-fusion, take an unexpectedly long amount of time. Another test was done where CPU.run was called on the weights and biases arrays resulting in some improvement, but still exhibiting higher than expected times for some phases, particularly at the end of the training - see the following truncated output: https://gist.github.com/avarsh/cc976140767252f6e3ba81d7efe50323

Steps to reproduce Run the code provided here: https://gist.github.com/avarsh/286f06133787e64e74574f86f3cf8bf4 (does not call CPU.run on the arrays at each step of training), and https://gist.github.com/avarsh/58373df585b6ef64a36e2c4b90c85206 (calls CPU.run).

Expected behaviour This program is not expected to take longer than a second to run on the machine - training is expected to be able to occur for ~1000 epochs within a second.

Your environment Run on a Intel i5-6500 CPU (running single threaded).

tmcdonell commented 4 years ago

Because train is a recursive Haskell function, this amounts to loop unrolling in the Accelerate program; you are just building up a larger and larger embedded program, which the rest of the compiler then has to chew on (for no good reason). What you should do is either:

  1. rewrite this into a loop in the embedded program (using awhile); or
  2. split the loop body into a function you feed to runN (to compile once) and then repeatedly apply it (via Haskell recursion)

Either of those should work around your immediate problem. Of course I'd still like to improve the performance of the compiler internals, but that will take much more time than for you to just give it a simpler program to begin with.

For reference here's the -ddump-simpl-stats output for one step of the 100 epoc program (you are asking it to do a lot!):

``` Total ticks: 627376 8744 Inline 8744 Var 25510 RuleFired 5576 zipWithD 3984 backpermuteD 2800 aletD/float 2792 generateD 2788 replicateD 2384 aletD/bind 1993 mapD 1992 x*1 800 aletD/eliminate 397 commutes (*) 4 reshapeD 34544 BetaReduce 34544 inline exp 485729 Substitution 199868 rebuild 175976 weakenE 60512 shrinkE 32269 weaken 8744 inline 5172 strengthenE 2392 replaceE/shape 796 replaceE/! 72849 SimplifierDone 72849 ```