Open avarsh opened 4 years ago
Because train
is a recursive Haskell function, this amounts to loop unrolling in the Accelerate program; you are just building up a larger and larger embedded program, which the rest of the compiler then has to chew on (for no good reason). What you should do is either:
awhile
); orrunN
(to compile once) and then repeatedly apply it (via Haskell recursion)Either of those should work around your immediate problem. Of course I'd still like to improve the performance of the compiler internals, but that will take much more time than for you to just give it a simpler program to begin with.
For reference here's the -ddump-simpl-stats
output for one step of the 100 epoc program (you are asking it to do a lot!):
Description I created a small neural network comparing both accelerate and hmatrix to perform the matrix calculations, and trained it for 100 epochs (iterations), but found that it took several seconds to train on the accelerate backend, as opposed to a few milliseconds when using hmatrix.
With 2 epochs, the debug output from the code was: https://gist.github.com/avarsh/8bdb89f80d3987c9f4aea52c4d7a7149 while with 100 epochs of training, the output becomes: https://gist.github.com/avarsh/89de99cd4869649f523db681105a90b7. In the latter output, some phases, such as array-fusion, take an unexpectedly long amount of time. Another test was done where CPU.run was called on the weights and biases arrays resulting in some improvement, but still exhibiting higher than expected times for some phases, particularly at the end of the training - see the following truncated output: https://gist.github.com/avarsh/cc976140767252f6e3ba81d7efe50323
Steps to reproduce Run the code provided here: https://gist.github.com/avarsh/286f06133787e64e74574f86f3cf8bf4 (does not call CPU.run on the arrays at each step of training), and https://gist.github.com/avarsh/58373df585b6ef64a36e2c4b90c85206 (calls CPU.run).
Expected behaviour This program is not expected to take longer than a second to run on the machine - training is expected to be able to occur for ~1000 epochs within a second.
Your environment Run on a Intel i5-6500 CPU (running single threaded).