IntelLabs / ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
BSD 2-Clause "Simplified" License
294 stars 32 forks source link

Comprehensions don't fuse #76

Closed ehsantn closed 8 years ago

ehsantn commented 8 years ago

The outer loops in the following code can be fused but ParallelAccelerator doesn't fuse them. Also, arr[:,i] is being copied to an intermediate array which is unnecessary.

using ParallelAccelerator

@acc function fuseme(x,y)
    arr = Float64[ i+j+0.3 for j in 1:x, i in 1:y]
    res = Int[ indmin(arr[:,i]) for i in 1:y]
end

fuseme(3,4)
DrTodd13 commented 8 years ago

@ninegua pointed out that if the indmin line had arr[i,:] instead of arr[:,i] then you couldn't fuse even though the iterations spaces are the same. Of course, if x and y are different then arr[i,:] would be an incorrect program. Even if they were the same (as below) then you still have this issue that presumably you'd need to detect and not fuse in this case. @ninegua thinks the analysis is going to get complicated fast (and I agree) and on top of that the complication of splitting the 2D parfor in order to fuse it and the figuring out when the nests of the second parfor is identical to some subset of the nests of the first parfor makes this complicated. Still, this sounds like something we could do long term but I doubt by 4/10.

@acc function fuseme(N) arr = Float64[ i+j+0.3 for j in 1:N, i in 1:N] res = Int[ indmin(arr[i,:]) for i in 1:N] end

ninegua commented 8 years ago

I talked to @ehsantn just now, there does seem to be a solution to this. If we translate indmin into a parfor loop, then the second comprehension becomes a two-level nested parfor, and the first comprehension is a single parfor with two loopNest. If we attempt to break the first parfor into a nested one, it could in theory fuse with the second, and then we have an outer parfor with a body consisting of two adjacent parfors that potentially could fuse.

Of course there is a lot of details involved, like how the intermediate array (arr in the above program) should be handled, and how we determine if the access pattern of the second loop allows the two outer loops to fuse, and what happens when the inner loops don't fuse (in which case I suspect we cannot get rid of arr and hence make the entire thing less profitable).

Still lots of unknowns, but at least fusing parfor with loopNests with a nested parfor sounds like an interesting topic to explore, if it helps to get rid of intermediate allocation.

ninegua commented 8 years ago

Another thought just came across my mind, maybe it is even possible to turn nested parfor into a single parfor with two loopNest? We may have to extend the reduction variable from scalar into vector, but doesn't look so far fetched.

ehsantn commented 8 years ago

@ninegua I was thinking about handling this in mmapInline() phase since arr is not used anywhere else. Is it possible?

ehsantn commented 8 years ago

As a short-term solution, I tried the code below. However, ParallelAccelerator chokes on some :boundscheck nodes ($(Expr(:boundscheck, false)) and $(Expr(:boundscheck, :(Main.pop)))). I tried @inbounds but it didn't help.

using ParallelAccelerator

@acc function fuseme(x,y)
    arr = [ Float64[i+j+0.3 for j in 1:x] for i in 1:y]
    res = Int[ indmin(arr[i]) for i in 1:y]
end

fuseme(3,4)
ehsantn commented 8 years ago

I hacked domain-ir to remove those nodes to see what happens. It still doesn't fuse :(

DrTodd13 commented 8 years ago

Fixed in 0ce4ad6.

Fusion does three kinds of substitutions in the second parfor's body in the process of creating the fused parfor. 1) Replace the second parfor's indices with the firsts'. 2) If array X is created in the first parfor, replace arrayref(X,...) with an element instance variable corresponding to the current iteration's value of that element of X. 3) Replace references to array names as a whole in certain other circumstances.

The problem here was with the 2nd kind of substitution. It was only looking for arrayref(X,...) and unsafe_arrayref(X,...). In this particular case, it was ParallelAccelerator.API.getindex instance instead. I've added that as another case to look for and this example now seems to work.

Close if you are satisfied.

ehsantn commented 8 years ago

It works. Thank you, Todd.