ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
704 stars 218 forks source link

profile and optimize PIConGPU #2280

Open psychocoderHPC opened 7 years ago

psychocoderHPC commented 7 years ago

It would be usefull to profile PIConGPU to find the current bottle neck methods.

possible tool for CPU

This result is created on my laptop with Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz and the PR #2275

out

btw: our function names are two long (up to 10k characters :-), therefor we need to strip the names with gprof2dot.

BenjaminW3 commented 7 years ago

I think what can be seen in the attached image is that allocating memory with new is slow, but we all knew this before ;-) For each variable in block shared memory, a single allocation is performed. As far as I can imagine, CUDA compilers have an advantage here, because they could sum up the size of all block shared memory at compile time. I will create a optimization ticket in alpaka. Edit: #409

psychocoderHPC commented 7 years ago

@BenjaminW3 You are right, but I am not sure if this is really the bottleneck. I played last week around and changed the allocation to a single allocation of e.g. 64kb and than increment only the byte pointer each time a shared variable is allocated. This has not increased the performance. But this was only a 5 minute hack and I need to validate it a little bit better.

psychocoderHPC commented 7 years ago

After adding this patch I get a very performance boost of 4% on my laptop.

Profile picture with patch:

out

[edit: updated image (previous image contained a pusher patch]