Documentation review and a few questions

Before I dive into the language proper, I decided to read the documentation and a few papers. Here are the issues I caught from top to bottom in the documentation, and some questions I am interested in:

1: The while loop having a return type other than unit is quite a new language construct for me. I am not sure about Haskell, but Scala and F# have unit as the return type. I found this confusing until it was explained two chapters later that it in fact did bind the return variable inside the loop on each iteration. Probably deserves a sentence or two, but what made this confusing was the following issue.

2: It is not explained what happens in the loop (pat = initial) when the = initial part is left out. I am guessing the for loop consumes the preceding variables. What makes this confusing is that in some examples, there is neither consumption (as the variable is declared for the first time in the loop ()) nor initialization (the = initial is left out.)

I was left wondering what is going on the matmult example thanks to that.

Given that the while loop was explained in the language reference, I expected to see further mention of this, but could not find it. I am guessing that they are initialized to the default value. But this should be in the documentation so I do not have to guess.

3: "has been given “ownership” of the array a, meaning that not caller of modify will reference array a after the call." "meaning that not caller" "not caller"

This sentence in the uniqueness types section is a bit grammatically iffy.

"meaning that the caller of modify will not reference array a after the call."

This rewrite seems a bit better.

4: "Sharing Analysis will cover sharing and sharing analysis in greater detail."

I am viewing the documentation on Chrome and the formatting is messed up at this point.

5: "This is useful if the compiler is otherwise unable to avoids bounds checks" "avoids bounds"

The avoids should be avoid.

6: "These defaults can be changed using the Hakell-inspired default keyword." "Hakell-inspired"

Haskell is misspelled in the above sentence in the language reference.

7: In the language reference the two loop entries have the formatting messed up. Connected to #2, the two entries do not explain what happens in case the = initial is left out.

8: Looking at the in-built functions in the language reference, I think Futhark has a pretty good sense of what the user might need. It even has functions that my favorite one does not have that actually seem like a good idea to have. If I had the ability to as easily transform data on the GPU on F# as I do on Futhark, I would have the ability to complete the AD library that I am developing and do reinforcement learning properly.

The only thing that stands out in a negative way is the lack of sorting functions of any kind.

I personally do not need them, but they always become necessary eventually and they are pretty hard to procure if one is just passing Cuda code as text strings to NVRTC like I am doing currently. I do not have access to Cuda Unbound or Thrust like I would if I was using C++. They are among the most important computer algorithms.

Adding a sorting function should be a high priority issue.

9: Does iota return only i32 arrays? Would changing the default to i64 change the type of the iota function?

10: Er, I just noticed that it is iota and not itoa. I thought the function stood for 'integer to array' so what is up with that?

11: In F#, the main library functions like map and filter are poorly optimized and rewriting the critical code segments in an imperative fashion can provide significant speedups. In that sense, Futhark is quite refreshing in that combining basic functional programming constructs is actually the preferred style. I am wondering if some tasks are currently very difficult for the Futhark compiler to handle? The sort functions perhaps? Atomics?

A GPU language like Futhark that is optimization focused strikes me as a different beast from what I have been programming up to now, so I would guess that as a user I would need to be familiar with the sort of optimizations it is doing more so than usual.

12: In connection with the last paragraph, I sized up Futhark and found that it is 1.5Mb long. I wonder how many lines of code is it now as a proxy for language size?

For comparison, I've read that F# has 70k-100k and I see that its src directory is 16Mb long. Though unlike Futhark, it has a lot of other stuff in the source directory, so I am not sure. Given its size in Mb, I think 100k LOC seems too low an estimate actually. 400k would be more likely.

13: I am wondering why have you decided to do the first version in OpenCL instead of Cuda? In terms of market share, I read somewhere that Cuda is 75% of the market. In the machine learning field, I would guess that this figure is only even more skewed in favor of Cuda.

14: Going by the latest paper it seems that Futhark is already capable of generating Cuda code. What is the status of the Cuda backend?

15: On Reddit, I remember @Athas mentioning that Futhark does not do dynamic parallelism, though I think in OpenCL that might be called nested parallelism. If so, I wonder if Futhark could be made to generate code compatible with NVRTC?

For now, the dynamic parallelism question is moot as the current version of NVRTC does not support it.

16: The cuDNN library is unimpressive with the exception of the convolutional forward and backward functions. This is similar to the Gemm routine in the cuBLAS library. What makes those functions special is that they were hand optimized at the SASS level making them capable of reaching speeds impossible by the C/C++ compiler. No matter how good Futhark is, it is not going to beat those tuned library functions. How is the FFI coming along?

17: In connection to the kind of optimizations Futhark is doing, there is plenty about fusion rules in the publications, but I've yet to see any mention of its overall memory handling strategy.

In Cuda programming, doing it dynamically – allocating when you need memory and disposing afterwards – is actually a poor strategy with significant performance penalties, and in addition to that memory allocation functions block the device. In the AD library I mentioned, I am using a object pool which allows me to reuse memory. The technique is quite simple, but very effective. Interfacing with other languages will require some sort of memory handling scheme. I wonder what the plans are for that?

This question is quite important as I know the Cuda backend and the FFI will get done eventually, but memory management is something that will need to be specified before it can be dealt with.

Making the memory handling specifications should be one of the highest priority issues.

18: The last one.

Regarding the REPL, I wonder how that would even work for a language like Futhark? Would that require that code be executed on the CPU and the global optimizations to be turned off? Admittedly, I do not know much about REPLs. Though the tag does say newbie-friendly it actually seems quite challenging, but maybe I am missing something?

That is all that came to mind as I brushing up on background material on Futhark.

I'll play a bit with the examples and make something like an LSTM cell in Futhark for my first project. It seems right as the post on optimizing RNNs on Nvidia's blog was the spark that made me want to study Futhark in the first place.

Take it one at a time.

(Edit: Slight grammar fix.)

Wow, that's a lot of questions! I will try to answer all of them (and fix the issues you report), but it is likely that I will end up skimping on some. Please don't hesitate to ask for further clarification.

You're probably also the first person to really dig into our documentation, os thank you for that!

As to your points:

1: Futhark is a pure functional language, so a construct returning unit is meaningless. You are probably correct that we need to spend more time describing the for/while loops, as they are quite unlike anything most people have seen. I remember it also took me a while to get them when my advisor explained how he wanted them to work.

2: If you write loop (pat), then that means the same as loop (pat = pat) - that is, the loop will take initial values from similarly-named variables in the environment (resulting in name shadowing). It's just a little syntactic shortcut that makes it easier to write imperative-looking code.

I agree that we should improve the documentation here. I will do that.

3: This is a typo - the "not caller" is supposed to be "no caller". I like your rewrite better, though.

4: Markup bug; fixed.

5: Thanks, fixed.

6: Thanks, fixed.

7: Also fixed, and I added a mention of implicit merge pattern initialisation.

8: Which functions are those? Futhark doesn't really have a standard library at the moment, which is also why things like sorting are not present. We also need to add some form of parametric polymorphism (probably inspired by SML functors) to Futhark in order to permit the writing of things like generic sorting routines. It is possible to write some sorting algorithms. For example, here is an implementation of radix sort: https://github.com/HIPERFIT/futhark/blob/master/data/tests/write/radix_sort/radix_sort.fut

However, I must admit that Futhark is still early in its design phase, and many things that "feel" simple are not really easy to implement. For example, you cannot easily implement an (efficient) parallel quicksort.

9: There should probably be a way to make iota return arrays of any type. At the moment, it always returns i32, but there's no deep reason for this. You can just map the result to whichever type you need, though.

10: From what I know, "iota" is commonly used for creating integer ranges in array languages. It's been that way since at least APL, but I don't know if that's where it originated, or why that name was chosen. I like that the itoa misspelling has a working acronym too, though!

11: Atomics don't really make sense from a Futhark POV. The language is parallel, not concurrent, so everything is already "atomic", as far as you are able to observe, anyway. Futhark is a small language and it is not very strong compared to big functional languages. Irregular parallelism (such as graph algorithms) can be tricky, and will require manual rephrasing by the programmer. There are also some low-level imperative tricks that are not possible to perform in Futhark, because it hides the nature of the hardware.

It is also not terribly hard to fool the Futhark compiler into making bad choices. In my experience, efficient Futhark programming mostly relies on not trying to be too clever. The compiler will generally assume that your code is "common sequential code" (i.e. no strided array accesses and the like) and optimise the code based on those assumptions. In time, the Futhark compiler will get better at handling more complicated patterns and not getting confused by partially hand-optimised code.

12: The Futhark compiler, including all frontends, backends, tools, and some of the testing harness, comprises 32124 SLoC (not counting comments and blank lines). We try to keep it as simple as possible by restricting features and configurability, but it's still a heavily optimising compiler, and that implies a sizable code base.

13: CUDA is a proprietary platform. I like open standards. I've also come to conclude that OpenCL is a nicer code generation target. Furthermore, OpenCL can run on CPUs as well as GPUs, and Futhark generates OpennCL code that runs quite well on CPUs if you just specify a CPU device when executing it, which can be handy.

But you're right that CUDA is sadly dominant in the field. That's mostly because the tools are better and it is (much) nicer to code by hand. Those factors don't matter for Futhark, though.

14: In that paper we are comparing Futhark performance with CUDA code, but Futhark still generates just OpenCL. We do hav a student working on a CUDA backend at the moment, but that is mostly for platforms that support only CUDA (like NVIDIA Tegra).

15: Futhark does support nested parallelism - it's one of our claims to fame! We just don't implement it with runtime support (like dynamic parallelism), but through clever code transformation and optimisation. The GPU kernels we generate are all perfectly flat.

There may be some (irregular) nested data-parallel programs that would benefit from dynamic parallelism, but given how much added complexity it would require, I'll need to be fairly sure it's worth the cost.

16: At the moment, our "FFI" ambitions revolve around making Futhark callable from other languages, not the other way around. However, it should be fairly simple to add an FFI that permits Futhark to call native C or CUDA functions. You are right that there is probably no way compiler-generated code can beat that kind of hand-tuned primitives - but Futhark can optimise the composition of functions, and we hope that will be useful enough.

17: Our memory handling strategy is pretty simple. We distinguish between memory blocks and arrays, and keep the blocks alive for as long as possible (through hoisting), and reuse them for other arrays, so we don't have the frequent alloc/dealloc-pattern you mention. We don't yet reuse memory as well as I'd want, but we have students working on that. I can give more details if you're interested.

18: The REPL would only be for low-speed interpretation for program development. Maybe someone can come up with a cool way to compile the code on the fly, but I'd be happy enough just having a simpler way to debug Futhark programs.

The tag "newbie-friendly" means that creating such a REPL just involves understanding the (source) Futhark AST, which is fairly simple, writing an interpreter, and writing a REPL. It requires Haskell knowledge, sure, but it requires no knowledge of the overall compiler structure, optimisation, code generation, or anything like that.

Thank you for your interest! If you find yourself in need of language extensions, don't hesitate to ask! The current feature set of the language is entirely grown from analysing the needs of (mostly financial) benchmark applications.

\ Troels /\ Henriksen

I have to inquire a bit more about memory management.

Would it be possible to pass in a workspace buffer to the Futhark program and have it do zero global GPU memory allocation and deallocation? What I would like to have is the ability to do all allocation from the F# side and simply pass in a workspace buffer (like I do for certain cuDNN functions now.)

Reusing already allocated memory is purely a speed optimization, but a very effective one. When I first added memory reuse to my library I got 100-200% boost in performance right off the bat. This particular optimization is important to me and potentially to other ML library developers.

I think that even without the calls to Cuda/OpenCL libraries (and even without the above optimization) that there is significant potential to make Futhark work well with recurrent neural nets, which are in fact exactly what I need right now. With the global optimizations that it does, I might be able to speed up my own library that is already on par with the mainstream libraries by 3-4x. I would have to completely scrap what I have now and make something like a neural net model compiler, which I am willing to do.

The ability weave the foreign library calls into Futhark code could increase that lead to 5-6x (the non hand tuned gemm can only get up to 75% of optimal performance) and the memory passing optimization described above would allow me the flexibility to call Futhark code more often because there would be no overhead from allocation.

I need the later one specifically in RL because I often have to call the net to give me its prediction for a particular move and unlike with a static dataset, the next input would depend on the current output, so there would be frequent back and forth between Futhark and F# code and presumably frequent memory allocations on the Futhark side without the workspace passing feature.

The next two paragraphs are just me talking and are not related to the question.

What interest me in Futhark is not just the potential boost to recurrent nets, but the more general gain to my programming ability. For example, for experience replay, I would need a function that grabs random inputs from a database and merges them together into one matrix. And some of those inputs might be of different lengths so I would need to take care of that as well.

This is a common bit of data manipulation what would not be difficult to do in F# on the CPU side, but to do it with Cuda on the GPU I would have to write custom Cuda kernels just for that, which would be a nightmare. And I do not want to do back and forth between the GPU and CPU just so I can do it the easy way. It is not just that, but there would be other seemingly trivial programming jobs in the future that would needlessly eat up my time with Cuda and I want to avoid those at all cost.

It is not yet feasible to write zero-allocation Futhark programs. For various reasons it may not ever be feasible (but it might come close). Some important points:

To be honest, the C backend does not yet generate code that can be reasonably interface with as a library. This is not due to any fundamental design deficiencies, but solely because I haven't gotten around to generating non-interactive entry points yet.
The Python/PyOpenCL backend is usable for generating libraries, because coming up with an entry point API was simpler: I could simply reuse the Numpy/PyOpenCL-array representations for the external API. One very nice property here, that I would also like to support for the C backend, is that if you pass in a PyOpenCL array that is already on the GPU, there will be no CPU-to-GPU copying. Also, Futhark will return PyOpenCL arrays that are GPU-resident, so there will be no GPU-to-CPU copying here unless you ask for it. I hope to do something similar for all future backends.
Generated code may need scratch memory, for example to store intermediate results in two-step reductions, or to perform memory access optimisations. The need for these cannot in general be predicted by the user and passed in through the entry point. This makes it very hard to write actual zero-allocation programs.
...however, there may be a workaround. For many programs, the Futhark compiler is able to hoist all allocations to the top level, with the size of the necessary allocations expressed solely in terms of the sizes (not values) of input arrays, and possibly a few scalars. In these cases, it may be possible to generate derived functions that can answer the question "if I call you function with these scalar parameters and arrays of these sizes, how much memory do you need?", and then move the allocation outside of the Futhark function itself. This works only if all allocations can be hoisted, which is not always obvious (and for complex programs often depends on nontrivial compiler analyses).
Presently, the Futhark compiler will always put the result of e.g. a map() in another memory block than the input to the map(). You can do explicit in-place updates with do-loops, but this would lose you parallelism. In time, the Futhark compiler will become smarter about memory re-use, and will certainly identify cases where the result of a map() can be stored in the same location as its input(s), but I'm hesitant about exposing this as a language construct.
I suspect most of the benefit of memory reuse could be obtained simply by keeping a memory resuse pool alive across function invocations, as you suggest. Presently, a Futhark entry point function will always deallocate all (non-returned) memory when it finishes, which is wasteful if we know that we are going to call the entry point soon anyway. Adding such pooling would be relatively simple.

\ Troels /\ Henriksen

That last point sounds great.

But knowing the exact amount of memory necessary is not required, it'd be fine with providing extra if the compiler cannot determine the exact and throwing an exception if the workspace provided is under the limit at runtime. If I eschew concurrent kernel execution which might be good with Futhark anyway, then most the GPU memory could be dedicated to single Futhark kernel. Does Futhark do any concurrent kernel execution on its own? I neglected to ask that, but based on the papers I've read, I would guess that it focuses solely on parallelization.

Some algorithms, like the FFT convolutions require ridiculous workspaces just by themselves - like around 300Mb per call - that a memory pool is absolutely necessary.

No concurrency yet, although I hope to get around to doing it at some point. There may be some applications that could benefit from it, but I'd like to see them first.

Have you measured allocation overhead as being proportional to allocation size? That surprises me. Given that the memory is not initialised, I'd imagine there being no real difference in the time it takes to allocate 10 vs 1000MiB.

Regarding that last question, I can safely say that allocation overhead depends massively on the allocation size. Multiplying two 1024x1024 matrices takes about as much time as allocating memory for their output, though it has been a while since I did those measurements so I do not remember exactly the sizes I measured.

Edit: I think that allocating and freeing 400Mb (for a 10k x 10k matrix) a 100 times takes over a second.

Edit2: Here is a little script that I wrote:

#if INTERACTIVE
#load "spiral_q_v2.fsx"
#endif
open SpiralV3
open System

let time_it (size : int*int*int*int) =
    let stopwatch = Diagnostics.Stopwatch.StartNew()
    for i=1 to 100 do
        use a = d4M.createConstant size
        ()
    stopwatch.ElapsedMilliseconds

let sizes = [|(1024,1024,1,1);(10240,1024,1,1);(10240,10240,1,1)|]
sizes 
|> Array.iter (fun size -> printfn "Timming 100 iterations of allocating and deallocating an array of size %A. Time elapsed is %i" size (time_it size))

Output:

Timming 100 iterations of allocating and deallocating an array of size (1024, 1024, 1, 1). Time elapsed is 114
Timming 100 iterations of allocating and deallocating an array of size (10240, 1024, 1, 1). Time elapsed is 1091
Timming 100 iterations of allocating and deallocating an array of size (10240, 10240, 1, 1). Time elapsed is 11621

As you can see, the time it takes to allocate a chunk of memory is pretty much linear in its size.

I will perform some experiments on our interactive benchmarks. Those perform dozens of calls per second to fairly cheap functions, so any memory allocation overhead should show up.

\ Troels /\ Henriksen

I manually modified generated code for the fluid benchmark to reuse all allocations across entry point invocations. The result was at most a 5% improvement in runtime performance (about 24ms per frame reduced to about 23ms per frame), and this is for a fairly allocation-heavy program (allocating 25 buffers per frame).

I'm not sure the added code complexity of adding memory reuse pools is worth it, at least not at this stage: there is other, lower-hanging fruit that does not bring additional runtime complexity. Concretely, I am thinking of a current student project that is working on reusing memory buffers within a Futhark program. This will result in at least a halving of the allocations per frame.

If we at some point find a program that really is bottlenecked by allocation speed, then we can revisit the issue. Gotta go fast, after all.

Neural nets are definitely allocation heavy as all the nodes need to be saved for the backwards pass. Wasn't there a NN in the Rodinia benchmarks? I'll take a look at it eventually.

diku-dk / futhark

Documentation review and a few questions #162