ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:

https://picongpu.readthedocs.io

Other

704 stars 218 forks source link

profile and optimize PIConGPU #2280

Open psychocoderHPC opened 7 years ago

psychocoderHPC commented 7 years ago

It would be usefull to profile PIConGPU to find the current bottle neck methods.

possible tool for CPU

gprof :
- add compiler flag -pg
- ./picongpu -d 1 1 1 -g 64 64 64 -s 50 --periodic 1 1 1 -p 1
- gprof ./picongpu gmon.out > prof.txt
clone gprof2dot
- cat ./prof.txt | gprof2dot.py -s | dot -Tpng -o out.png

This result is created on my laptop with Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz and the PR #2275

out

btw: our function names are two long (up to 10k characters :-), therefor we need to strip the names with gprof2dot.

BenjaminW3 commented 7 years ago

I think what can be seen in the attached image is that allocating memory with new is slow, but we all knew this before ;-) For each variable in block shared memory, a single allocation is performed. As far as I can imagine, CUDA compilers have an advantage here, because they could sum up the size of all block shared memory at compile time. I will create a optimization ticket in alpaka. Edit: #409

psychocoderHPC commented 7 years ago

@BenjaminW3 You are right, but I am not sure if this is really the bottleneck. I played last week around and changed the allocation to a single allocation of e.g. 64kb and than increment only the byte pointer each time a shared variable is allocated. This has not increased the performance. But this was only a 5 minute hack and I need to validate it a little bit better.

psychocoderHPC commented 7 years ago

After adding this patch I get a very performance boost of 4% on my laptop.

Profile picture with patch:

out

[edit: updated image (previous image contained a pusher patch]