Open psychocoderHPC opened 7 years ago
I think what can be seen in the attached image is that allocating memory with new is slow, but we all knew this before ;-) For each variable in block shared memory, a single allocation is performed. As far as I can imagine, CUDA compilers have an advantage here, because they could sum up the size of all block shared memory at compile time. I will create a optimization ticket in alpaka. Edit: #409
@BenjaminW3 You are right, but I am not sure if this is really the bottleneck. I played last week around and changed the allocation to a single allocation of e.g. 64kb and than increment only the byte pointer each time a shared variable is allocated. This has not increased the performance. But this was only a 5 minute hack and I need to validate it a little bit better.
After adding this patch I get a very performance boost of 4% on my laptop.
Profile picture with patch:
[edit: updated image (previous image contained a pusher patch]
It would be usefull to profile PIConGPU to find the current bottle neck methods.
possible tool for CPU
gprof
:-pg
./picongpu -d 1 1 1 -g 64 64 64 -s 50 --periodic 1 1 1 -p 1
gprof ./picongpu gmon.out > prof.txt
cat ./prof.txt | gprof2dot.py -s | dot -Tpng -o out.png
This result is created on my laptop with
Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
and the PR #2275btw: our function names are two long (up to 10k characters :-), therefor we need to
strip
the names with gprof2dot.