Array-valued stencil functions [was: sorting algorithm]

kdmadej commented 10 years ago

Is there any sorting algorithm implemented in the accelerate package (or are there any plans to include such an implementation in the package) or should I implement one on my own?

tmcdonell commented 10 years ago

There is a very basic implementation of radix sort included in the examples: https://github.com/AccelerateHS/accelerate-examples/blob/master/examples/nofib/Test/Spectral/RadixSort.hs#L115

However its absolute performance is not very good, compared to state-of-the-art GPU sorting techniques. If you need high performance, then I'd recommend using the Accelerate FFI to call to a different implementation such as Thrust. I intended to write a package to do this, but haven't found the time yet.

wdanilo commented 10 years ago

@tmcdonell: I'm working with @zgredzik. He was talking about sorting floating point array, so radix sort is not usable here as far as I know. Probably simple quicksort would be ok for this use case (computing mediana of a stamp of image). Is there a better way to do it?

tmcdonell commented 10 years ago

Ah! Yes radix sort won't work in this case because there is no coercion function in Accelerate. The Thrust version of radix sort does support floating point, however.

Otherwise, you could try this version of quick sort that I stumbled upon a while ago. I don't know how good/complete/fast it is however, as I haven't yet tried it. https://github.com/tkvogt/accelerate-quicksort

mchakravarty commented 10 years ago

@wdanilo I believe, you can also use radix sort for floating point numbers. @gckeller, didn't you look into this?

Doing quick sort hallway efficiently on a GPU is probably tricky. I'd rather use merge sort (which is easier to parallelise) if you cannot use radixsort.

How big is the array that you need to sort? (As sorting is a memory-bound, not a compute bound problem, it may very well be faster to just sort on the CPU.)

If your arrays are big enough to sort on a GPU, but not very big, NVIDIA seems to recommend using bitonic sort or odd-even mergesort: http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-sorting-networks

wdanilo commented 10 years ago

@tmcdonell , @mchakravarty: Thank you for replies :) I have to check the radix sort tommorow.

I think we are talking about a little different use cases. This sorting is needed by a "mediana filter" - just a stencil - one of 3x3, 5x5, 7x7 up to about 30x30. It is a stencil, so the sorting is done for each "window" ~ which gives approximately as many sorts as pixels - so it is naturally parallel. I do not that in such case we should concider making a super efficient parallel sorting - it should be rather simple not paraller sort, because it will be run on each window in parallel. Am I right?

@mchakravarty: thank you for the link! It seems very interesting! :)

kdmadej commented 10 years ago

Slightly off the topic: I decided to implement any kind of sort asap so I went with the least tedious implementation I could think of:

quicksort array = (lesser xs) A.++ (A.unit p) A.++ (greater xs)
    where p = array A.!! 0
          xs = A.tail array
          lesser  = A.filter (A.<* p)
          greater = A.filter (A.>=* p)

Imagine my surprise when I realised I can't concatenate Scalars with Vectors (well, it's true that wouldn't happen if I checked the types beforehand, but still). Is it really an intended behaviour and if so - can someone explain to me why? It seemed pretty logical at first.

tmcdonell commented 10 years ago

Yes, that is intended. Concatenating a scalar and vector makes sense because there is only one way to do it, but that's not the case if you want to concatenate a vector with a matrix. (++) here works for any dimensionality, intersecting the lower dimensional components.

You could reshape p, or jest define it as take 1, which also avoids a round-trip to the CPU.

kdmadej commented 10 years ago

Thank you for the info. Unfortunately I stumbled upon another problem. I'm trying to calculate the median filter using the stencil method. My question would be: is there any way to sort the values that are matched by the stencil in each iteration?

Tried converting them to both normal and accelerate arrays and then sorting them and extracting that middle value, but in both cases I stumble upon different problems. As far as I'm concerned I have to work with Exp Bools when comparing Exp values and that limits me to working with accelerate arrays. The problem here is that I have no idea how to create an accelerate array out of that matched stencil.

When doing something like:

A.stencil middleValue A.Mirror array
    where middleValue ((a,b,c),(d,e,f),(g,h,i)) = (A.use $ A.fromList (A.Z A.:. 9) [a,b,c,d,e,f,g,h,i]) A.!! 4

I end up with an accelerate array full of Exp (Exp Double) values. Was trying to play around with the unlift function but that didn't work out.

Does anyone have any suggestions on how to implement such a filter using accelerate? I'd be greateful for any thoughts and/or suggestions. Please take into consideration I'm focusing on doing this any way possible right now, performance will become an issue at a later stage of development.

tmcdonell commented 10 years ago

Aside: longer or more broadly applicable questions might be better directed towards the mailing list, as it gets more exposure.

Anyway, yes, you can't do that, because you are attempting to express nested parallelism. Your problem isn't as general as that, but still we don't have a way to encode doing some computation on a local, temporary array.

I think you might be able to encode something like bitonic sort on a list of scalar (Exp) values. Try using Haskell (host language) to generate the appropriate scalar code programatically. On output you'll just get a collection of scalar values, rather than an array, but it might work...

mchakravarty commented 10 years ago

Carrying on from @tmcdonell's comment, this is exactly the way to do it. Why? There are two reasons: (1) You do not have a general sorting problem, because the array you want to sort is always of the same length (i.e., the number of elements in the stencil). (2) The fastest way to sort on a GPU is using a sorting network (which you can encode with Accelerate for fixed size sorting problems).

As @tmcdonell explained, you cannot turn the scalar values of a stencil into an array. Why? Because that would be nested parallelism (using parallel array computations inside parallel array computations) and that's hard to implement efficiently on GPUs in the general case. (@robeverest is working on this, but it is a PhD level research problem, so don't hold your breath just yet.)

So what you want to do instead is to write a Haskell function that generates a scalar Accelerate computation that implements sorting for a fixed number of scalar values. To illustrate the idea, the following function sorts two values:

sort2 (a, b) = a < b ? (a, b) : (b, a)

And now for three values:

sort3 (a, b, c) = a < b ? (a < c ? (b < c ? (a, b, c) : (a, c, b)) : (c, a, b))
                        : (b < c ? (a < c ? (b, a, c) : (b, c, a)) : (c, b, a))

and so on.

This is, of course, too tedious to write by hand for larger sets of values. Hence, write vanilla Haskell code that generates Accelerate code for sortN for a fixed N. This is btw exactly how sorting is implement in hardware (in networks switches etc), so there is a lot of work in efficient fixed-size sorting networks. Common ones are Batcher's odd-even sort http://en.wikipedia.org/wiki/Batcher_odd–even_mergesort (popular in GPUs) and Bitonic sort http://en.wikipedia.org/wiki/Bitonic_sort. (These sorting networks tend to have a recursive structure, so you can construct sorters for 2 * n sized networks from sorters of n sized networks. You can exploit that in your generator.)

You'll need to write such a sorter generator for nested tuples as used in Accelerate stencils. Does that make sense?

mchakravarty commented 10 years ago

Aside: longer or more broadly applicable questions might be better directed towards the mailing list, as it gets more exposure.

On the other hand, I really like that these discussions are much better organised in GitHub issues. Maybe we should just forward the GitHub Issue emails to the list?

kdmadej commented 10 years ago

@tmcdonell @mchakravarty guys thanks a lot for your help. That was exactly the kind of explanation I needed to justify sorting the values with functions for fixed-size collections. Feels great to learn something new early in the morning :+1:

wdanilo commented 10 years ago

@mchakravarty: I've got a connected question to you, If I migh. (I'm writing here, because you told, you like the organization of the discussions on Github. If @tmcdonell and you decide such discussions are better suited for mailing list, I will ask @zgredzik and all the guys to write there instead of here :) )

I think the question is straightforward:

Right now, we can define stencils by using tuple of tuples. Why cannot we use just Accelerate Arrays for it? We know ist dimensions and layout during compile time and theoreticaly we are able to optimize it the same way as tuples? Am I right?

tmcdonell commented 10 years ago

On the other hand, I really like that these discussions are much better organised in GitHub issues. Maybe we should just forward the GitHub Issue emails to the list?

@mchakravarty Actually this is true. Having the mailing list subscribe to github notifications might be a good idea, although some people will receive the message twice. Oh well.

tmcdonell commented 10 years ago

@wdanilo It's easy to statically determine in the type the size and dimensionality of the stencils, and make sure everything matches up. For arrays we only know the dimensionality, not the size/extent.

Repa has some nice Template Haskell to make defining stencils easier, in a move visual layout. Adding some syntactic sugar in this way would be much easier.

mchakravarty commented 10 years ago

@mchakravarty Actually this is true. Having the mailing list subscribe to github notifications might be a good idea, although some people will receive the message twice. Oh well.

I trust that anybody who is using GitHub has the skill to filter out some duplicate messages if they are bothered by them.

mchakravarty commented 10 years ago

Right now, we can define stencils by using tuple of tuples. Why cannot we use just Accelerate Arrays for it? We know ist dimensions and layout during compile time and theoreticaly we are able to optimize it the same way as tuples? Am I right?

As @tmcdonell wrote, the crucial difference between arrays and tuples is that the tuple representation fixes the exact size of the stencil. If we would use arrays, we would need an extra argument to specify the size of the array stencil. This might seem easier, but it is also more error prone, because you can get out-of-bounds errors for stencil arrays (whereas there is nothing like out of bounds errors for the tuple representation). Another advantages of tuples is that the code generator can statically see if you use a sparse stencil (i.e., many of the stencil values aren't used in the stencil expression) and avoid loading them, which can significantly improve performance. In the case of arrays, that would be much more difficult as any arithmetic on array indices can make it statically quite difficult to predict the access patterns.

Nevertheless, I fully understand that tuples are sometimes inconvenient to work with. More precisely, I think the tuple representation is usually more convenient for small stencils, whereas an array representation would be more convenient for large stencils. My understanding is that in graphics (and computer vision) algorithms, you often have larger stencils, which —I assume— is why you'd rather have an array representation. (In contrast, in applications from scientific computing, stencils are usually small and the tuple representation works very well.)

In summary, we really would like to have two flavours of stencil operations: (1) one kind using tuples and (2) another kind using arrays to hold the stencil values. Both representations have their advantages and it depends on the algorithm you implement which one is more convenient. So far, most Accelerate users were happy with tuples, but as you seem to use algorithms where arrays might be convenient, maybe you can help us with the design of the API.

So, here are some of the open questions:

How do we specify the size of the stencil (when using array stencils)?
Can we somehow gracefully handles out-of-bounds stencil array accesses?
How can we optimise sparse stencils in an array representation? (At least in computer vision, some algorithms stencils that are too large to use tuples, but they are very sparse and we must avoid redundant loads. Have you got such algorithms in graphics?)
How can we avoid nested parallelism? (If stencil computations use arrays, how can we prevent people from using stencil computations inside stencil computations? Even if we use a fold in a stencil computation, it cannot be a parallel fold.)

@robeverest Given your extensions to handle array computations in array computations, could we get a stencil operation that gets an array-valued stencil function as an argument through the frontend? (This is modulo the question of how to restrict the admissible operations inside such an array stencil operation.)

mchakravarty commented 10 years ago

BTW, this is another advantage of having questions here on GitHub, we can turn them into feature discussions/bug reports inplace.

robeverest commented 10 years ago

@robeverest Given your extensions to handle array computations in array computations, could we get a stencil operation that gets an array-valued stencil function as an argument through the frontend? (This is modulo the question of how to restrict the admissible operations inside such an array stencil operation.)

Stencil operations are tricky when it comes to vectorisation generally. In order to have stencil operations in a nested context (i.e. embedded in another array computation) we would need a segmented version of the stencil operation. Something like that may be constructible from existing combinators, but it wouldn't be at all efficient, thus defeating the purpose of having the specialised stencil construct.

Because of this, my solution has been, at least for now, to restrict the use of stencil operations to only work at the top level of nesting and have no nested parallelism in either the function argument or the array argument (an error being generated by the vectoriser otherwise). In reality we can do a bit better than this, the array argument can have nested parallelism as long as it's entirely contained within it (i.e no dependance on free variables that had to be lifted), and I'm working on this now. Problems like this aren't limited to stencil operations.

Back to the original question, given that stencil operations are essentially left mostly as-is anyway, I see no reason we couldn't have one that took a function over arrays. As you say, we'd have to restrict the operations inside the function. I would suggest perhaps allowing for scalar functions that can take arrays as arguments. For example given our current definition of scalar functions:

data PreOpenFun (acc :: * -> * -> * -> *) env aenv t where
  Body :: Elt t => PreOpenExp acc env      aenv t -> PreOpenFun acc env aenv t
  Lam  :: Elt a => PreOpenFun acc (env, a) aenv t -> PreOpenFun acc env aenv (a -> t)

Add an extra type of lambda

  Slam :: Arrays a => PreOpenFun acc env (aenv, a) t -> PreOpenFun acc env aenv (a -> t)

This would allow us to express the functions we want. Dealing with this from within the code generator might be tricky, however.

mchakravarty commented 10 years ago

@robeverest I don't understand why we cannot have nested parallelism in the array argument. Is it a problem in the case where we fuse a nested produced into a stencil computation?

Add an extra type of lambda

Slam :: Arrays a => PreOpenFun acc env (aenv, a) t -> PreOpenFun acc env aenv (a -> t)

I don't think that would work. A stencil function needs to return an array (which must be of the same extent as the argument array).

robeverest commented 10 years ago

@robeverest I don't understand why we cannot have nested parallelism in the array argument. Is it a problem in the case where we fuse a nested produced into a stencil computation?

I perhaps worded that badly. You're right, we can have nested parallelism in the array argument, it's just harder when the array argument depends on variables that were previously lifted. For example, if we assume that the array argument contains nested parallelism, then it has to be lifted. So we take it from something like

Array sh e

to

(Vector sh, Vector e)

(This is just the flattened array representation I'm using for now. Obviously, this may change, but the point I'm making is that we have no way of performing a stencil operation over any flattened array representation.)

However, if it does not depend on any already lifted variables, we can lift it into something like

Int -> (Vector sh, Vector e)

which allows us to "choose" the length of the segment descriptors. If we just choose just the length to be just 1, then we can reshape the value array to the required dimension and perform a normal stencil operation over it. With the changes I'm making at the moment, this should always be possible provided the stencil operation is at the top level of nesting (or can be hoisted up to the top level), a restriction we have to have anyway.

Add an extra type of lambda
Slam :: Arrays a => PreOpenFun acc env (aenv, a) t -> PreOpenFun acc env aenv (a -> t)
I don't think that would work. A stencil function needs to return an array (which must be of the same extent as the argument array).

Oh right, I misunderstood. I thought we wanted stencil functions which took an array argument, but still returned scalar values, like our stencil functions do now. If we want functions from arrays to arrays, that's going to be harder. I don't immediately see an obvious way to do that. Ultimately, we would need scalar functions that could manipulate arrays in some way - i.e. a scalar map.

kdmadej commented 10 years ago

Reffering to

I don't think that would work. A stencil function needs to return an array (which must be of the same extent as the argument array).

and

Oh right, I misunderstood. I thought we wanted stencil functions which took an array argument, but still returned scalar values, like our stencil functions do now.

Isn't that the case? As far as my understanding of the problem goes, the stencil function takes an array and returns an array, but the function used inside that stencil function should in my opinion return a scalar value (can't think of a situation, where I'd take a neighbourhood of a pixel and then return a whole new neighbourhood instead of just a new value for that single pixel). Please correct me if I'm mistaken or have misunderstood something.

On a slightly different note: I came across another use case that is probably impossible to implement without the support for array-stencils. When trying to scale an image and then apply a re-sampling filter the base filtering window should get resized according to the scaling factor. For instance: a Catmull-Rom filter that takes 2 pixels from each side of the central one, should after scaling up the image 2 times take 4 from each side in the resulting image (unless, of course, I understood nothing from the materials I've found on the oh-so-wonderful Internet). That requires handling a dynamic kernel size for the stencil function, which would in my opinion be a real pain in the... brain... to implement using tuples.

This is probably a good example of a situation where sparse array-stencils would come in handy. After scaling up more than 2 times groups of consecutive pixels hold the exact same value so for some filters it might be unnecessary to take all of them into consideration, instead of just every n-th one of them.

mchakravarty commented 10 years ago

@zgredzik You are right about the stencil function. I got confused here.

How important is this functionality to you guys? I'm asking, because @robeverest's extensions are in an experimental branch of Accelerate, which will take a while before it becomes useful for general use. If you need array-based stencil functions really bad, we should think about whether there is a way to somehow graft it onto the current main branch.

kdmadej commented 10 years ago

It seems that this functionality is more important to us than we anticipated at first. Even though I probably made a mistake regarding the kernels used for re-sampling when scaling an image, another use case came up.

Some solutions/algorithms allow the user to define a filter with an image. The problem here is that the size of the image, thus the size of the stencil, remains unknown until runtime. Additionally, It can most likely vary quite a lot, so hard-coding fragments of code that would handle different tuples might not be an option. Trying to solve this with nested tuples and loading images into them seems like a bit of an overkill to me (actually, they probably don't even work here).

In general, the deeper we dig into image processing the more we realise functionality like Array-valued stencils are something we'll probably have a hard time working without.

Furthermore, when I tried to generate the code sorting tuples of different size I realised that the amount of comparisons that would have to be generated in order to sort a tuple representing larger stencils (ie. 30x30) is so outrageous that I'm not even sure such a large program would fit into the resources of the graphic card.

mchakravarty commented 10 years ago

@zgredzik Ok, I fully accept that we need array-valued stencils. So, let's try to figure out what functionality is needed and how to realise it.

In particular, one design decision is whether the size of stencils is statically know. Even with an array-valued stencil, we could require to statically know it's size. At the very least, this would allow us to optimise its implementation.

In your example, you are saying that when you define a filter with an image, you won't know the size of the stencil until runtime. Which runtime do you mean? After all Accelerate programs are dynamically compiled at Haskell runtime. Do you mean that you don't know the image size when the Accelerate program is (runtime) compiled?

(Please excuse me if this is a naive question, but I don't know how to filter an images with stencils.)

kdmadej commented 10 years ago

My bad, it skipped my mind that there are actually two types of runtimes (the runtime of the pure haskell code, and the Accelerate-generated CUDA code, am I right?).

If my perception of the way Accelerate works is correct, then I think I don't know the image size at the time the Accelerate program is being compiled. I'm using the accelerate-io package to read the image so I guess the size is discovered when the final Accelerate computation is being done.

mchakravarty commented 10 years ago

@zgredzik That would still make it an Accelerate compile-time value, because you have to read the image, and then, you use the image.

We haven't defined the interface for array-based stencils yet, but for statically sized stencils, the stencil definition would have to include the stencil size as Haskell values (not as Exp values). That means, that after you read your image, you would have to get it's size and then pass those values as normal Haskell values to the stencil function. This all seems feasible. The major trade off here is that because the stencil function gets the size as Haskell values that need to be computed after the image has been read, the stencil function (and the Accelerate computation is embedded) would be newly compiled each time you read a new image (or at least an image of a size not used before).

On one hand, that causes some overhead, but on the other hand, the resulting GPU code would probably run faster, too. So, I wonder, (1) are these filters using images in stencil expensive computations and (2) would the filters often be run with images of varying sizes?

tmcdonell commented 10 years ago

I should also point out that --- if the stencil size changes per-image and you are thus generating a new Acc program each time, then you won't be able to use the run1 execution style. Even if the kernels are cached, because they've been used before, there will still be a big performance hit.

mchakravarty commented 10 years ago

@tmcdonell It depends on whether the size changes for every filtered image. If you load one filter image to apply as a filter to many other images, you can still use run1 once for each filter image.

So, it really depends how much work there is for every loaded image used as a filter.

AccelerateHS / accelerate

Array-valued stencil functions [was: sorting algorithm] #142