diku-dk / futhark-benchmarks

Various Futhark benchmark programs
Other
40 stars 7 forks source link

Matrix multiply benchmark request #4

Closed mrakgr closed 8 years ago

mrakgr commented 8 years ago

Besides the #3 , I have a question on matrix multiply. What is its performance compared to the hardcoded library functions? If it is something like a third, then it might be possible to speed it up by more than double.

I had wanted to suggest a few days ago, that the most ideal solution to make Futhark great for machine learning would be to have an assembly optimized matrix multiply as a part of its core library. Really, as I mulled it over, Futhark does not need a FFI, but only this single piece and maybe a fast convolution to meet this goal, but when I did some research I realized that unlike in standard C, OpenCL does not allow complete access to hardware. More specifically, Cuda devices compile to an intermediate language PTX and there is no way to go lower than that. And going lower to SASS would be necessary to draw out the 100%...though reading that link I doubt Futhark doing optimizations at the assembly level would ever be feasible even with access to SASS. What I wanted to suggest was just a fancy idea.

I think doing global optimizations for RNN architectures could go a long way on its own.

For example, two of the big symbolic differentiation machine learning libraries Theano and Tensorflow, are in fact doing code generation internally and large parts of them could be superseded by Futhark's functionality. Theano in particular is infamous for having ridiculous compile times for recurrent nets.

For the sake of potential future machine learning endeavors, it might be of interest to track the matrix multiply performance over time. Furthermore, for neural nets it would also be of interest not just how fast Futhark does Ax, but also Ax+b and o(Ax+b) where o() is a non-linear activation function and +b is a broadcast addition.

Furthermore recurrent nets also have layers of the form o(Ax+Bh+b).

LSTM cells in particular have gates that all receive similar inputs and previous states like so: block_input=o(A_1x+B_1h+b_1) input_gate=o(A_2x+B_2h+b_2) forget_gate=o(A_3x+B_3h+b_3) output_gate=o(A_4x+B_4h+b_4)

A common optimization is to concatenate the 1..4 matrices above into one and do the matrix multiplies as one operation. It would be interesting to know whether Futhark does that optimization automatically.

As an aside, the above actually made me throw the towel for my own machine learning library, because I realized that writing things like concatenation kernels, splitting kernels, fusing them manually with broadcast addition and non-linear activation was something that would be far too much of a bother. It is not that I could not do it, it is just that it would take too much time. Let the paid Google engineers do those things in Tensorflow or alternatively, Futhark could do it. Futhark could do even more and actually fuse all the matrix multiplies, bias additions and activations into a single function.

If you are interested, I could do a LSTM cell forward pass as a Futhark benchmark, though I am not familiar enough with OpenCL do get a matrix multiply right now.

Would that be fine?

...Also, the first question was whether a simple matrix multiply benchmark had been done. :)

athas commented 8 years ago

We have a simple naively written matrix multiply program: https://github.com/HIPERFIT/futhark/blob/master/data/tests/MatMultFun.fut

It does not perform particularly well compared to hand-optimised implementations, although I haven't measured exactly how bad it is. I know why it cannot perform well, though: it's because Futhark is not yet able to do block tiling and caching in local memory, which is a must-have trick for matrix multiplication. It's somewhere on the list of things that we think we can do fairly easily, but it's not at the top of the list yet.

Futhark has pretty good fusion, but it will not fuse across concats and splits. We do have some idea of how that could be done, but I'd personally like convincing benchmark programs to have an idea of how and when that would be useful.

With regard to 'o(Ax+b)', isn't that just multiplying every elements with a scalar, adding a scalar to every element, then calling an arbitrary function with the resulting matrix? Sounds pretty simple. I'm not really sure what a "non-linear activation function" means, though.

In general, Futhark is not built to exploit "very high-level" properties, like mathematical properties of matrix operations. A more restricted language focused on just matrix operations might be able to do better. However, Futhark is supposed (and able) to perform a lot of aggressive transformations based on more immediate dataflow properties, which will probably also help much matrix-heavy code.

\ Troels /\ Henriksen

mrakgr commented 8 years ago

That it does not do block tiling is pretty bad news for me. Having an efficient (though not necessarily 100%) matrix multiply would be essential for all neural net uses.

In general, one way of doing an efficient convolutional operation is to do it using banded matrix multiplication. The way to do a convolutional operation using matrix multiply would be to transform the input matrix and the weight matrix, arranging and duplicating the columns of the input matrix so the matrix multiply corresponds to the convolutional operation.

Furthermore, for large filters there are FFT convolutions (that have a matrix multiply in the middle) and for small filters there exist fast Winograd algorithms, but all of these would require using shared and local memory for efficiency.

The Nervana guys went all the way and actually optimized the Winograd algorithms in SASS assembly.

Having efficient splits and concats, the first use for that that I can think off besides being able to optimize the LSTM cell would be optimized convolutional operations, though I am not sure if splits and concats would be specifically necessary for convolutional operations...or even the LSTM for that matter. I think the activation could be done using imperative tricks as it does not require any shared memory.

In the 'o(Ax+b)' equation, A is a weight matrix, x is a input matrix (though can be a vector) and b is a vector that gets added to every column in a broadcasting fashion. An example of a non-linear activation would be the sigmoid function f(x)=1/1+exp(-x) or tanh f(x)=tanh(x). The above is so commonly used that I did not see the need to go much into detail.

I very much am interested in those other dataflow operations as they could be used to build a really great ML library, with much more ease than by using plain Cuda. Moreover, once it becomes possible, being able to fuse o(Ax+b) into one operation would also be something unseen in any other library. I think it would be worth working towards it, bit by bit.

In the short term, a person's programming skill depends on his aptitude and experience.

In the long term, it is by the tools he uses. When humans hit their limit, they wander around and build machines and abstractions. A person's rate of growth is also mostly dependent on other people because no matter how brilliant one is, it is impossible for him to match the effort of thousands of others. In the end, inspiration comes down to exploration.

I think that currently in machine learning, there is a large misallocation of resources. I've asked you in another post the number of LOC for Futhark. You answered around 32k.

For comparison, just the Python wrapper for Tensorflow has ~60k LOC.

Ultimately, all those big DL libraries should be built on top of an optimizing language much like Futhark for efficiency, but so far I do not think this thought is something that has occurred to the people in the ML community. And as for Google and Facebook, they can just pay to have hundreds of engineers work on their DL libraries full time rather than PL research. Nvidia especially, but Google, Facebook and also Microsoft dropped the ball here. Google currently is content to ape Nvidia and is working on an open source Cuda compiler. Theano and Tensorflow already do some optimization at the codegen level, but they do it in a halfhearted manner.

What I will do here, is make a simple benchmark for a single LSTM cell forward pass in Futhark and using standard OpenCL functions.

It will serve as a measure of Futhark's suitability towards NN tasks and it will be worth revisiting every six months or so, just to see how far the language has proceeded.

mrakgr commented 8 years ago

Ok, here is the forward pass. It took me an unusually long time to make and I even got dumbfounded for a while due to having to constantly mentally shuffle array dimension in my head. I took a look at the Rodinia benchmark a few days ago, but I could not understand the code as there weren't any clear delineations between operations.

Here it is, the LSTM forward pass: (Edit: With the correct shape annotations this time hopefully.)

-- Standard matrix matrix multiplication
-- R - regular matrix dimensions
-- F - Futhark matrix dimensions
-- F(o,m) = R(m,o)
-- F(n,o) = R(o,n)
-- R(m,o) * R(o,n) = R(m,n) = F(n,m)
fun [[f32,n],m] matmult([[f32,o],m] a, [[f32,n],o] b) =
    let res = replicate(m, replicate(n,0f32)) in
    loop (res) = for i < m do
        loop (res) = for j < n do
            let partsum =
                let res = 0f32 in
                loop (res) = for k < o do
                    let res = res + a[i,k] * b[k,j]
                    in  res
                in res
            in let res[i,j] = partsum in res
        in res
    in res

-- Adds two matrices
fun *[[f32]] add([[f32]] x, [[f32]] y) =
    zipWith(fn [f32] ([f32] x, [f32] y) =>
                zipWith(fn f32 (f32 x, f32 y) => x + y
                ,x,y)
            ,x,y)

-- Elementwise multiplies two matrices
fun *[[f32]] hadmult([[f32]] x, [[f32]] y) =
    zipWith(fn [f32] ([f32] x, [f32] y) =>
                zipWith(fn f32 (f32 x, f32 y) => x * y
                ,x,y)
            ,x,y)

-- Broadcast addition
fun [[f32,m],n] broadcast_add([[f32,m],n] a, [f32,n] b) =
    let b = transpose(replicate(m,b)) in
    add(a,b)

-- Ax + b
-- The standard feedforward neural net linear layer
fun [[f32,n],m] linear_layer([[f32,o],m] A, [[f32,n],o] x, [f32,m] b) =
    broadcast_add(matmult(A, x), b)

-- Ax + Bh + b
-- The recurrent neural net linear layer
fun [[f32,n],m] linear_layer_2([[f32,o],m] A, [[f32,n],o] x, [[f32,m],m] B, [[f32,n],m] h, [f32,m] b) =
    broadcast_add(add(matmult(A, x), matmult(B,h)), b)  

fun f32 exp(f32 a) =
    let e = 2.71828f32 in
    e**a -- Futhark needs more math functions. e ~= 2.71828.

fun f32 sigmoid(f32 a) = 1.0f32 / (1.0f32 + exp(-a))

-- The sigmoid activation function
-- All output elements should be in the [0,1] range
fun [[f32]] sigmoid_activation([[f32]] a) =
    map(fn [f32] ([f32] a) =>
            map (sigmoid,a)
        ,a)

-- Feedforward Neural net layer with a sigmoid activation
fun [[f32,n],m] sigmoid_layer([[f32,o],m] A, [[f32,n],o] x, [f32,m] b) =
    sigmoid_activation(linear_layer(A,x,b))

-- Recurrent neural net layer with a sigmoid activation
fun [[f32,n],m] sigmoid_recurrent_layer([[f32,o],m] A, [[f32,n],o] x, [[f32,m],m] B, [[f32,n],m] h, [f32,m] b) =
    sigmoid_activation(linear_layer_2(A,x,B,h,b))

-- The LSTM cell/layer forward pass. Outputs a (output, cell) tuple.
-- This is the standard LSTM formula, without peepholes.
-- At the current time of writting 5/9/2016, LSTM is the most widely used
-- recurrent network architecture and has been for the past decade or so.
fun ([[f32,n],m], [[f32,n],m]) 
    lstm_cell_forward([[f32,o],m] W_bi, [[f32,m],m] U_bi, [f32,m] b_bi, 
                      [[f32,o],m] W_ig, [[f32,m],m] U_ig, [f32,m] b_ig,
                      [[f32,o],m] W_fg, [[f32,m],m] U_fg, [f32,m] b_fg,
                      [[f32,o],m] W_og, [[f32,m],m] U_og, [f32,m] b_og,
                      [[f32,n],o] input, 
                      [[f32,n],m] prev_output,
                      [[f32,n],m] prev_cell ) = 
    let block_input = sigmoid_recurrent_layer(W_bi,input,U_bi,prev_output,b_bi) -- In a real LSTM, this activation function is generally tanh, but can be sigmoid.
    let input_gate = sigmoid_recurrent_layer(W_ig,input,U_ig,prev_output,b_ig) -- The activations for the gates are always sigmoid
    let forget_gate = sigmoid_recurrent_layer(W_fg,input,U_fg,prev_output,b_fg)
    let output_gate = sigmoid_recurrent_layer(W_og,input,U_og,prev_output,b_og)
    let cell_state = add(hadmult(block_input,input_gate),hadmult(prev_cell,forget_gate)) -- As the backpropagation step is linear, the flow of gradients though the cell allow to go far further than in vanilla RNNs. As a consequence, LSTMs can learn sequences with far longer time lags.
    let output = hadmult(output_gate, sigmoid_activation(cell_state)) -- Here the activation can also be other than sigmoid or even left out entirely.
    in (output, cell_state)

fun ([[f32,n],m], [[f32,n],m]) 
                 main([[f32,o],m] W_bi, [[f32,m],m] U_bi, [f32,m] b_bi, 
                      [[f32,o],m] W_ig, [[f32,m],m] U_ig, [f32,m] b_ig,
                      [[f32,o],m] W_fg, [[f32,m],m] U_fg, [f32,m] b_fg,
                      [[f32,o],m] W_og, [[f32,m],m] U_og, [f32,m] b_og,
                      [[f32,n],o] input, 
                      [[f32,n],m] prev_output,
                      [[f32,n],m] prev_cell ) = 
    lstm_cell_forward
        (W_bi,U_bi,b_bi,
         W_ig,U_ig,b_ig,
         W_fg,U_fg,b_fg,
         W_og,U_og,b_og,
         input,
         prev_output,
         prev_cell)

Futhark could use math functions, I had to write in the e constant manually.

Starting from the top, I adapted the imperative version of the matrix multiply to use variable dimensions, made the matrix add, made the broadcast add, built the linear layers on top of them and then the nonlinear layers. From those non-linear layers, I build the LSTM cell forward pass. The backward pass would be likewise similar, though last year when I was starting out, I actually failed to make it by hand, so I'd rather have a program reverse the forward pass and generate the code for it automatically this time.

The new matrix multiply might be good enough to replace the imperative example in the documentation, as it pretty much the same, but has extra functionality and showcases how array dimension annotations can be used.

If you would note, as an optimization, I could pass in the weights and biases in concatenated form and split them after the multiply as an optimization, but I will be mean and let Futhark itself optimize the loads.

It might not seem like it, but the above is actually all one needs to make a neural net and the most important and complex part is the matrix multiply. Currently, Futhark lack the things to make that part good, but unlike the static libraries, it actually has the pieces to make everything else work well.

The things I have left:

Options:

So I have a lot of options here. There are some basic Python Cuda options as well, like Cudamat, which does not do any delayed evaluation and emulates a subset of Numpy's functionality on the GPU, but I'd rather skip that one, as it is old.

I'll admit that until I did the research today, I was not aware of any of the options above apart from ArrayFire. In terms of great GPU libraries, C++ is impossible to beat in this day and age. To do a sideways comparison with Futhark, I'll have to pick one of the four above.

Do you have any preferences or suggestions? If not, I'll just pick whatever catches my eye tomorrow.

concieggs commented 8 years ago

Wow, that's a lot of effort you're putting into this. I promise I'll try to put in an equal amount from my side!

It may be delayed a few days, however, as both my advisor and I are travelling from Tuesday to Thursday to give talks and talk to interesting people. I'm not sure how much time I will have for compiler hacking.

\ Troels /\ Henriksen

athas commented 8 years ago

Are you sure that linear_layer_2 has the correct shape annotations? From what I can see, the two matrices you pass to add will only have compatible sizes if n=m.

mrakgr commented 8 years ago

You are right [[f32,n],n] B, [[f32,m],n] h is pretty much the same thing that dumbfounded me yesterday and I see I got it wrong. Damn.

I have no idea what is going on in the above. Basically matmult(B,h) has to have the same dimensions as matmult(A,x). B has to be a square matrix and h itself also has to have the same dimensions as matmult(A,x) as it is the output from the previous timestep.

So all the matrix multiplications have to be of [[f32,m],n].

If I make B of type [[f32,m],m] which would align the first dimension correctly, then h can only be of [[f32,n],m] which is wrong. It is also impossible that the prev_output needs to be transposed or anything like that.

It would be nice if I could insert some print statements into the program. This is such a strange logic error on my side. I have no idea how to debug this.

Edit: Try 2.

-- R - regular matrix dimensions
-- F - Futhark matrix dimensions
-- F(m,n) = R(n,m)
-- F(o,m) = R(m,o)
-- F(n,o) = R(o,n)
-- R(m,o) * R(o,n) = R(m,n) = F(n,m)
fun [[f32,m],n] matmult([[f32,o],m] a, [[f32,n],o] b) =

So based on the above, the logic error is in the matrix multiply. It should be this instead.

fun [[f32,n],m] matmult([[f32,o],m] a, [[f32,n],o] b) =

Edit2: Actually, this begs the question of how good Futhark's shape annotation checker is, if it did not figure out the above. Is this potentially a bug in the shape checker?

A different one from the numpy out of bounds issue, I mean.

mrakgr commented 8 years ago

By the way, I've finished the ArrayFire example a while ago.

#include <iostream>
#include <chrono>
#include <arrayfire.h>
#include <af/util.h>
typedef std::chrono::high_resolution_clock Clock;

using namespace af;

inline auto sigmoid_recurrent_layer(array W, array x, array U, array h, array b) {
    return sigmoid(matmul(W, x) + matmul(U, h) + tile(b,1,x.dims()[1]));
}

std::pair<array, array> lstm_cell_forward
(array W_bi, array U_bi, array b_bi,
    array W_ig, array U_ig, array b_ig,
    array W_fg, array U_fg, array b_fg,
    array W_og, array U_og, array b_og,
    array input,
    array prev_output,
    array prev_cell
    ) 
{
    auto block_input = sigmoid_recurrent_layer(W_bi, input, U_bi, prev_output, b_bi);
    auto input_gate = sigmoid_recurrent_layer(W_ig, input, U_ig, prev_output, b_ig);
    auto forget_gate = sigmoid_recurrent_layer(W_fg, input, U_fg, prev_output, b_fg);
    auto output_gate = sigmoid_recurrent_layer(W_og, input, U_og, prev_output, b_og);
    auto cell_state = block_input * input_gate + prev_cell * forget_gate;
    auto output = output_gate * sigmoid(cell_state);
    return std::make_pair(output, cell_state);
}

int main()
{
    try {
        // Select a device and display arrayfire info
        af::setDevice(0);
        af::info();

        // Initializes the weights and the inputs randomly.
        int input_size = 784;
        int hidden_size = 256;
        int batch_size = 64;
        array W_bi = randu(hidden_size,input_size);
        array U_bi = randu(hidden_size, hidden_size);
        array b_bi = randu(hidden_size);
        array W_ig = randu(hidden_size, input_size);
        array U_ig = randu(hidden_size, hidden_size);
        array b_ig = randu(hidden_size);
        array W_fg = randu(hidden_size, input_size);
        array U_fg = randu(hidden_size, hidden_size);
        array b_fg = randu(hidden_size);
        array W_og = randu(hidden_size, input_size);
        array U_og = randu(hidden_size, hidden_size);
        array b_og = randu(hidden_size);
        array input = randu(input_size,batch_size);
        array prev_output = randu(hidden_size,batch_size);
        array prev_cell = randu(hidden_size, batch_size);

        auto r =
            lstm_cell_forward(W_bi, U_bi, b_bi,
                W_ig, U_ig, b_ig,
                W_fg, U_fg, b_fg,
                W_og, U_og, b_og,
                input,
                prev_output,
                prev_cell);

        auto t1 = Clock::now();
        for (int i = 0; i < 1000; i++) {
            auto r =
                lstm_cell_forward(W_bi, U_bi, b_bi,
                    W_ig, U_ig, b_ig,
                    W_fg, U_fg, b_fg,
                    W_og, U_og, b_og,
                    input,
                    prev_output,
                    prev_cell);
        }

        auto t2 = Clock::now();
        std::cout << "Delta t2-t1: "
            << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count()
            << " miliseconds" << std::endl;

    }
    catch (af::exception& e) {

        fprintf(stderr, "%s\n", e.what());
        throw;
    }

    return 0;
}

Running the above example takes around 0.6s on my machine. I am also done with the Python module.

import numpy as np
import lstm
import time

c = lstm.lstm()

# initializes the weights and the inputs randomly
input_size = 5
hidden_size = 32
batch_size = 20

W_bi = np.random.random((hidden_size, input_size)).astype(np.float32)
U_bi = np.random.random((hidden_size, hidden_size)).astype(np.float32)
b_bi = np.random.random((hidden_size,1)).astype(np.float32) # Remove the 1 from the 4 biases to trigger the code generation bug.
W_ig = np.random.random((hidden_size, input_size)).astype(np.float32)
U_ig = np.random.random((hidden_size, hidden_size)).astype(np.float32)
b_ig = np.random.random((hidden_size,1)).astype(np.float32)
W_fg = np.random.random((hidden_size, input_size)).astype(np.float32)
U_fg = np.random.random((hidden_size, hidden_size)).astype(np.float32)
b_fg = np.random.random((hidden_size,1)).astype(np.float32)
W_og = np.random.random((hidden_size, input_size)).astype(np.float32)
U_og = np.random.random((hidden_size, hidden_size)).astype(np.float32)
b_og = np.random.random((hidden_size,1)).astype(np.float32)
input = np.random.random((input_size,batch_size)).astype(np.float32)
prev_output = np.random.random((hidden_size,batch_size)).astype(np.float32)
prev_cell = np.random.random((hidden_size, batch_size)).astype(np.float32)

assert (np.dot(W_bi,input).shape == prev_output.shape)
assert (np.dot(U_bi,prev_output).shape == prev_output.shape)

now = time.time()
output = c.main(
        W_bi,U_bi,b_bi,
        W_ig,U_ig,b_ig,
        W_fg,U_fg,b_fg,
        W_og,U_og,b_og,
        input,
        prev_output,
        prev_cell)
end = time.time()

print output
print "Time elapsed: %f" % (end-now)

With the above parameters it takes around 15s to run a single iteration on my machine.

This is only a rough estimate, but it seems currently Futhark is over 100,000x fold slower than the ArrayFire example. There is a lot of work before it is ready for prime time it seems.

I did say that humans level up using tools and abstraction, so I guess at this point in my development, I have no choice but to start learning Haskell. I am at a dead end with my ML library, until something like the complete Futhark comes along and gives me a hand. Until then I will just be using Tensorflow. Hopefully, it will come out in Windows soon.

Besides ArrayFire, I did look at Eigen, Boost::Compute and Mshadow as candidates. Here is a brief review of the three.

Really, out of all the libraries, only ArrayFire stands out after all.

That is about it for now. Have fun on your trip.

mrakgr commented 8 years ago

Good news. I managed to speed up the Futhark example by 10,000x times by replacing the imperative matrix multiply with a functional one. Here is the new one:

fun [[f32,n],m] matmult([[f32,o],m] a, [[f32,n],o] b) =
    let b = transpose(b) in
    map(fn [f32,n] ([f32,o] a) =>
            map(fn f32 ([f32,o] b) =>
                reduce(+,0f32,zipWith(*,a,b))
            ,b)
        ,a)

The above version is much nicer than the functional version you linked to me. It should be worth putting into the documentation. To be honest, I was not sure what all those replications were doing so I just tried removing them and to my surprise it worked.

Right now, Futhark does 100 iterations in the time it takes ArrayFire to do 1000 so it is only 10x slower. This could be attributed to not using shared memory caching as you pointed out.

Also in the documentation you should put a warning not to use the imperative version. Rather, put them side by side with the functional one and write that the imperative version is an example of not what to do otherwise like me, the reader might assume that Futhark is capable of optimizing the imperative version.

athas commented 8 years ago

Good catch! I was actually going to write you about this. When compiling with futhark-opencl, all arrays are always kept on the GPU, even when executing sequential loops. A GPU round-trip for every scalar load and store leads of course to a ridiculous overhead. A factor of 10000x does not at all sound far-fetched.

The parallel matrix multiply was translated by my advisor from a Repa implementation. You are right that it is silly and not idiomatic Futhark. It mostly serves as a fusion/simplification test case. I will replace it with a more reasonable implementation.

Where in the documentation do you find references to matrix multiplication? I can fix them too.

Also, could you give me the exact data sets you use to run the Futhark implementation? I'll try to see if there is any low-hanging fruit.

mrakgr commented 8 years ago

The dataset is just numpy.random.random. I have not bothered to test for correctness by comparing the ArrayFire version with Futhark, only speed. I should probably do that, but I forgot about it once I saw how slow the imperative version is. I also found a bug in code generation.

The imperative matrix multiply is in the documentation near the bottom.

mrakgr commented 8 years ago

I am thinking whether to make the inputs predefined and am leaning towards skipping it. At this point I am not checking the outputs anyway, just their speed. If Futhark needs a correctness test for the LSTM, it would make more sense to make a Haskell version and compare it with that inside the test suite. So I will skip it then.

With that, I think I have everything in this issue page that I wanted. Until Futhark learns to use shared memory for caching there is no point in using it for machine learning. The most current example with a loop is in my fork.

I've been thinking yesterday of how I could contribute more substantially to the Futhark project and from the perspective of internalizing the thing, there would be no better way than to do a F# port, just like I did for the GVGAI and Diffsharp libraries and learned how to make Atari style games and automatic differentiation respectively. Originally, I meant to learn Haskell and play with the source code, but reading the Haskell to Scheme book, I am finding that I do not have much intrinsic motivation to learn Haskell apart from needing it to understand Futhark's source. This is a problem. I do like F# most of all, it seems.

That having said, the difference between GVGAI and Diffsharp is that Futhark is in a league of its own when it comes to size, so can't really imagine myself going all the way as it would take me a quarter of a year of full time work. Probably twice that time. But I will give it a shot for a while as I feel it is my duty to make GPU programming easier. Maybe I'll find some bugs along the way.

At any rate, I'll keep myself abreast on this project. I'll reopen this issue and try running the tests next year.

P.S. One thing I want to say is that the imperative matrix multiply in the documentation really needs a note that it is really slow. And a link to the preferred functional version should be close to it as well.

athas commented 8 years ago

Your contributions are very welcome. Don't hesitate to ask if you need help understanding the code base. Writing an F# backend will not require you to understand much about how the compiler works, and will only require basic Haskell.

I agree that efficient matrix multiplication is necessary for Futhark to be useful for machine learning applications. While I am focused on other improvements for the next few months, my advisor and co-developer is working on techniques that may prove useful. One problem is to come up with an optimisation that is robust enough to apply in multiple cases. I don't particularly fancy an optimisation that applies only to one specific function written in a specific style. If that's all we can come up with, perhaps the FFI approach is better.

\ Troels /\ Henriksen

mrakgr commented 8 years ago

It is good to know I am not bothering you. To be honest, my lofty goal here is to figure out the Futhark compiler completely rather than to do a backend and that will take a while. No only that, but I will have to start work on RL more seriously as for the past year I've been indulging myself in writing things like ML libraries and learning things that might be useful in general. A few days ago, I did a brief skim of the Futhark source just to figure out whether I could do anything with it and I've concluded that my Haskell skills were below the minimum necessary to undertake this project.

So, for the past few days I've going through the NLP section on HackerRank using Haskell.

It has been a slow, but steady going. You might imagine that it would go faster given that I am quite good at F#, but at best I am maybe a quarter as efficient in F#. A lot of the things that I would do effortlessly in F# can take me half an hour or more in Haskell. I'll get better with time, I am sure.

I'll get back to you when I am ready to do something.

mrakgr commented 8 years ago

No, actually there is one thing that I want to ask.

What sort of editor are you using to edit Futhark source? The Atom editor just crashes on the Futhark source, maybe due to the size of the thing, but it works fine for the simple HackerRank problems I've been doing.

athas commented 8 years ago

I use Emacs with haskell-mode, as do most of the other hackers. It works well enough. My advisor uses gedit. For figuring out how the code fits together, I can also recommend the Haddock docs: http://futhark-lang.org/haddock/ Of course, not everything is as documented as it should be, but at least you can easily navigate the code.

Now, there are many bad ways to setup Emacs and haskell-mode, and Emacs is itself a pretty baroque UI by modern standards. This is my own haskell-mode setup:

https://github.com/Athas/Configuration-Files/blob/master/emacs/elisp/my-modes.el#L120

(The 'with-features' form is a convenience macro of my own, which you wouldn't want to use.)

I think the installation guide on the haskell-mode Github repository is sensible: https://github.com/haskell/haskell-mode#quick-installation

\ Troels /\ Henriksen

athas commented 8 years ago

There is also this new thing: https://commercialhaskell.github.io/intero/

I have not tried it yet.

mrakgr commented 8 years ago

Yeah, it might be worth trying when I get back to work on Futhark if the ghc-mod bug still has not been fixed by then. Thanks.

Somehow or the other, I feel knee deep into Linux networking and I haven't even been able to complete my Haskell study though I was very close to the end a week ago. After that I need to go into reinforcement learning, which is really important to me. I've delayed it for over a year. It is a pity that after all the effort I put in into learning F# and Haskell, I am going to have to do programming in Python because all the big deep learning libraries are in it. :(

I've been starting to warm up to Haskell too after a rocky start - that lambda-stacking fold technique is genuinely impressive. At first I thought it was spew, but after thinking about it now I see it as the right was to do tail recursion. Had I had it and it was not slow in F#, I would have used it instead of all those tail recursive loops that I had written.

I've tried the lambda-stacking fold in F# and unlike in Haskell, it is unfortunately not optimized which is exactly what I expected. For that reason I never even thought of the technique in the first place, but it is in learning such techniques where biggest improvements in one's programming skills lie.

I have no idea when I am to find time to resume work on Futhark - even July seems unlikely now - though I do think I have the programming chops to start the project now. I'll find time eventually, I promise.

I will not want to keep programming in Python forever at any rate. Python lucked out in this deep learning wave. For sure, it is dog slow and unsafe, but in regards to slowness, even for reinforcement learning which needs to simulate games, that is not a problem there because all the time will be taken up by running those neural nets. So that just leaves it being unsafe. It is a pity that all the features my ideal language would have are sharded across different languages.

athas commented 8 years ago

I have done some more work on loop tiling-related optimisations (mostly related to unrolling), which gave another x2 speedup on matrix multiplication (roughly). The next step will be register tiling the output, but it's somewhat more tricky, as it involves sacrificing some parallelism, and hence is only a win on sufficiently large matrices.

mrakgr commented 8 years ago

Ok. When you feel the matrix multiply optimizations are ready, give me a holler and I'll take a look.

athas commented 8 years ago

Well, you can try out your old benchmark and see if it performs closer to your expectations. You'll need to do some rewriting as the language syntax has changed. We still have no chance of beating hand-written primitives written by experts, so don't get your hopes up too much - it's much better than it used to be, but there is still room for improvement.