Optimized PushPX - Githubissues

liuyuchuncn commented 2 weeks ago

Kernel : PushPX general local memory , We want to analyze the causes of local memory and how to optimize local memory As shown in the figure below, the array produces local memory amrex::Real sx_node_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt}; ``amrex::Real sx_cell_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt};

and amrex::Real sx_node_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt}; ``amrex::Real sx_cell_galerkin[depos_order + 1 - galerkin_interpolation] = {0._rt}; used to compute ShapeFactor https://github.com/ECP-WarpX/WarpX/blob/5b34b84dfc588c8f6cf8088e94cd884e850344f7/Source/Particles/ShapeFactors.H#L55-L70

ax3l commented 2 weeks ago

Hi @liuyuchuncn,

Thank you for profiling this kernel and for aiming to improve its performance, this is super welcome! :rocket: :sparkles:

Let me also loop in my colleague @WeiqunZhang, who last year reduced the register pressure somewhat by doing compile-time splitting of this kernel in #3402 and #3696.

There can be more done for sure and we are happy to support you in optimizing this! Thanks a lot for your interest already!

WeiqunZhang commented 2 weeks ago

@liuyuchuncn Could you try this PR? https://github.com/ECP-WarpX/WarpX/pull/5217

ax3l commented 2 weeks ago

@liuyuchuncn can you share the inputs file you used for benchmarking please? :) We just want to make sure we look at the same kernels.

AlexanderSinn commented 2 weeks ago

Hello @liuyuchuncn, thanks for reporting this. I also agree that push/deposition has a lot of potential for improvement. Which GPU are you referring to with M200, do you mean AMD MI210 or something else?

The main issue I see is the computation of the shape factor. The code https://github.com/ECP-WarpX/WarpX/blob/5b34b84dfc588c8f6cf8088e94cd884e850344f7/Source/Particles/Gather/FieldGather.H#L162-L194 relies heavily on local arrays (amrex::Real sz_node[depos_order + 1];) and swapping them if necessary. However, on GPU, these local arrays in some ways don't exist like on CPU. When they are indexed with a dynamic index (not known at compile time) the full array has to be placed in thread local memory (which is in the GPU main memory, so much slower than registers) and then the indexed value can be retrieved. Ideally, the shape factor would be fully rewritten to not include local arrays (see https://github.com/Hi-PACE/hipace/blob/40fed7b01b1b985ecd6b417c37ec99d76ef2bc7a/src/particles/particles_utils/ShapeFactors.H#L123).

Another issue is that in the PushPx kernel, the shape factor nox and galerkin_interpolation is a runtime parameter, meaning that all possible shape factors have to be compiled in the same kernel. This can be fixed by extending the CTOs to nox and galerkin_interpolation.

Finally, when looking at the assembly, keep in mind that instructions like add and move are much cheaper compared to loading and storing global (and thread local) memory.

AlexanderSinn commented 2 weeks ago

Yes that is correct. What I mean with a dynamic index is the index when the array is used (not defined) in https://github.com/ECP-WarpX/WarpX/blob/5b34b84dfc588c8f6cf8088e94cd884e850344f7/Source/Particles/Gather/FieldGather.H#L369-L377 for example, ix in sx_ex[ix]. This is a special case where the loop bounds of the for loop are known at compile time, and in case the loop is fully unrolled by the compiler, ix is also known at compile time and using local memory can be avoided. However, unrolling the loop requires much more registers. But I suspect because the array can be swapped when its initialized ((ex_type[zdir] == NODE) ? sz_node : sz_cell ); it has to use local memory anyway.

liuyuchuncn commented 2 weeks ago

Hi @AlexanderSinn I agree with you In real-time operation，only use 2 arrays, Such as sz_node sz_cell_v , But at compile time, the compiler does not know which one to use，So The compiler needs to allocate space for 4 arrays sz_node 、sz_cell、sz_node_v、 sz_cell_v ， Some arrays can only be placed in local memory due to insufficient registers. https://github.com/ECP-WarpX/WarpX/blob/5b34b84dfc588c8f6cf8088e94cd884e850344f7/Source/Particles/Gather/FieldGather.H#L162-L195 In addition, if the compiler knows the number of for loops, it will unroll the loop, which will increase the use of registers and cause greater register pressure. https://github.com/ECP-WarpX/WarpX/blob/5b34b84dfc588c8f6cf8088e94cd884e850344f7/Source/Particles/Gather/FieldGather.H#L369-L377

liuyuchuncn commented 2 weeks ago

the input file i uesdautomated_test_1_uniform_rest_32ppc and add cell config as follow

max_step = 100
amr.n_cell =  128  128 192
amr.max_grid_size = 256 
amr.blocking_factor = 32

Reduce Local memory access issues by putting arrays into shared memory, The pseudocode is modified as follows

    __shared__ amrex::Real sshared[(depos_order + 1)*4*blockIdx.x];
    amrex::Real * sz_node =   sshared[threadIx.x * 16 + 0];
    amrex::Real * sz_cell =   sshared[threadIx.x * 16 + 4];
    amrex::Real * sz_node_v = sshared[threadIx.x * 16 + 8];
    amrex::Real * sz_cell_v = sshared[threadIx.x * 16 + 12];

Performance improvement is only 3%

liuyuchuncn commented 2 weeks ago

Shared memory is useful for me in reducing local memory , close this issue

ECP-WarpX / WarpX

Optimized PushPX #5199