IntelLabs / ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
BSD 2-Clause "Simplified" License
294 stars 32 forks source link

bitarray selection is still broken #52

Closed ninegua closed 8 years ago

ninegua commented 8 years ago

The following program shows what is wrong:


@acc function sum1(A::Array{Float64,1})
    sum(A[A .> 0])
end

srand(0)
A = randn(10)
println(sum1(A))
println(@noacc sum1(A))

Basically something like A[C] where C is a bitarray is being translated to mmap in DomainIR, but this operation by itself _is not parallelizable_. So the subsequent fusion with the reduction in ParallelIR turned out to be wrong too.

This is only parallel when combined with a reduction such as sum(A[C]), or with an assignment such as A[C] = B[C]. We used to handle this just right, but the recent changes to support both bitarrays and ranges selection are too aggressive.

I'd suggest we leave :select as a DomainIR operation for A[C] instead of using :mmap, and treat :select properly in ParallelIR, including a sequential translation for A[C] when it cannot be parallelized.

ninegua commented 8 years ago

BTW, this is what is causing nengo to produce wrong output.

ehsantn commented 8 years ago

A[C] is semantically parallel I think. It could be translated to :mmap and reduce since it implies concatenation of the selected elements into a new array. Later in ParallelIR sum can be fused so the reduction is optimized. What do you think?

ninegua commented 8 years ago

How is bitarray selection parallel when A=[1,2,3,4,5] and A[A .< 3]==[1,2]?

ninegua commented 8 years ago

I see what you mean. But unless we support dynamically growing j2c_arrays, mapping it to parallel operation is going to be rather inefficient.

ehsantn commented 8 years ago

The backend implementation is a different issue. We can make it sequential for now if the operation wasn't removed in optimizations after going through the ParallelIR pipeline. Keeping DomainIR and ParallelIR general is useful I think.

ninegua commented 8 years ago

A deeper issue is that translating A[C] to mmap and reduce is not composable. What I mean is by this definition, it is hard to compose the translation of sum, and the translation of A[C]. Suppose we follow this translation like you proposed:

sum(x) ===> reduce(0, (a,b)->a + b, x)
A[C] ===> reduce([], (a, b) -> cat(a, b), mmap((a, c) -> c ? [a] : [], A, C))

Then as an outcome we have:

sum(A[C]) ===> reduce(z0, f0, reduce(z1, f1, mmap(f2, A, C)))
where 
   z0 = 0
   f0(a,b)=a+b
   z1=[]
   f1(a,b)=cat(a,b)
   f2(a,c)=c? [a] : []

I don't know a general rule to transform reduce-reduce for arbitrary z0, z1, f0, f1, f2 at DomainIR level. If we just go ahead and lower each reduce to Parallel IR, I doubt we'll be able to fuse these two loops since their loop counts are different.

This is why I'd prefer to keep A[C] as in select(A, C) in DomainIR, and lower reduce(z, f, select(A, C)) as a special case of lowering reduce in ParallelIR.

ehsantn commented 8 years ago

You are right. A separate select seems to be better.

ninegua commented 8 years ago

This is fixed in c408907152dfdd122e8f0fc3e75929c96afce7de. Test cases are added in e3796ada19a5b06e2f920b7b212b6a9e13441a96.