Lazy kernel representation for nested parallelism on GPUs

vchuravy commented 6 years ago

Yesterday on the call the question came up how to extend this so that we can exploit the within kernel parallelism that is present in

            overelems(mesh, h, bathymetry, U⃗, Δh, ΔU⃗) do elem, mesh, h, bathymetry, U⃗, Δh, ΔU⃗
            @inbounds begin
                hbₑ        = bathymetry[elem]
                htₑ        = h[elem] + hbₑ
                U⃗ₑ         = U⃗[elem]

                Δh[elem]  += ∫∇Ψ(dX⃗ * U⃗ₑ * J)
                ΔU⃗[elem]  += ∫∇Ψ(dX⃗ * (U⃗ₑ * U⃗ₑ' / htₑ + g * (htₑ^2 - hbₑ^2)/2 * I) * J)
            end
            end

and in particular how we can move from one element per GPU thread to multiple GPU threads per element, in order to scale to higher dimensions and higher approximation orders. (I suspect that the current code is really only efficient for small orders, since we already seeing high register pressure). Peter and I had brief discussion about this yesterday and we came to the conclusion that one could represent the kernel above in a lazy fashion (taking the structure of broadcast as an example) and then compile it differently on the CPU and GPU, where the GPU can then load the elements into shared memory and threads cooperatively work on the data.

There is an unrelated question: I would like to understand where the register pressure is coming from and if that is something we can further optimise/if it due to Julia features we are using, but I haven't found any tool that let's us properly reason about the register allocation patterns.

ali-ramadhan commented 6 years ago

@vchuravy I don't know much about GPU profiling but I was wondering if something like VTune Amplifier would give some idea as to which LoC/features are spending the most time allocating/copying memory (if that's a proxy for register pressure?). I don't even know if you can profile Julia GPU code with VTune yet.

lcw commented 6 years ago

Thank @vchuravy for summarizing the discussion we had yesterday. I look forward to seeing what can come out of this space.

Look here and here for other people using the one-thread per element approach. And here for how performance results are typically reported these days (comparing against a memcpy).

For reference here is how the DOE is approaching performance portability for their atmospheric code. It is a similar, but not the same, numerical method to what we will be implementing.

And here is an attempt that I tried in the past, although it never made it into a production code. What I like about this approach is that it gives the developer knobs to tune the code. Splitting the algorithm from the data layout and the schedule of execution.

vchuravy commented 6 years ago

Splitting the algorithm from the data layout and the schedule of execution.

Absolutely! That would be my goal in the end.

Regarding performance measurements, there is a lot of head-room to still address and we are a far-cry away from having the same behaviour as memcpy. I spend a little more time on profiling today and 80% of global memory traffic is caused by spills (flux integral). I have several ideas for how to address this, but I won't have time for a while to play with this code. I started prototyping one of the earlier (https://github.com/JuliaLabs/ShallowWaterBench/tree/vc/facedix) where the idea was to force specialisation on the face so that the index accessed is statically know and a bunch of code can be statically eliminated. The getindex seems to be the primary culprit and that code can be simplified a lot at (https://github.com/JuliaLabs/ShallowWaterBench/blob/18b0d42033ac8f12df688bd0b9f2e42dd50d72da/src/TotallyNotApproxFun/src/TotallyNotApproxFun.jl#L177). Another issue is that since Julia is a safer language than C we actually have the remnants of overflow checks etc in the code, which are never triggered, but manifest as CALL.REL.NOINC($ptxcall_kernelf_2$ptx_throw_overflowerr_binaryop);` in the SASS and might be causing extra register pressure.

jkozdon commented 6 years ago

Another thing you may want to keep in mind is that once the geometry metric terms are non-constant the register pressure will be even higher since you introduce another 5 fields in 2-D and 10 fields in 3-D in the discretization (Jacobian determinant as well as all the metric derivatives).

JuliaLabs / ShallowWaterBench

Lazy kernel representation for nested parallelism on GPUs #22