Closed scarlehoff closed 5 years ago
it will need to be always hard-coded for OpenACC to be happy.
what do you mean exactly by that?
You cannot have a library that takes a pointer to a function (like in the openmp version) but rather have to know which function you are integrating at compile time.
Furthermore, that function needs to be declared as a parallelizable routine with the appropiate openacc pragma (something along the line of #pragma acc routine <mode>
) where <mode>
is how the function should be parallelized,
i.e, should it be given to every thread? Should it be given to a bunch of thread and they will break it down even further?
<mode>
is a rabbit hole which I only understand very superficially but I hope I will understand it properly once I write the MC using pure cuda and I am forced to do everything by hand.
Now, I was thinking, openACC is probably not the fastest option in GPU, but as long as the integrands can be streamlined into one single big function with no external calls (and, up to LHAPDF, this is true for MCFM) it can be easily used. This means we could even benchmark things like MCFM.
This should not be a priority but it is something to keep in mind for the future.
First working version.
Right now the integrand is hard-coded within the loop, it can be taken outside (surrounded by the right #pragma calls) but it will need to be always hard-coded for OpenACC to be happy.
Provided there are no bugs, this (in its current version) will be the baseline for the Cuda comparison. As the Cuda version improves (for instance, moving the generation of the random numbers or the refinement of the grid inside the GPU) I will modify this version so we can always compare hand-made Cuda to pragmatic*openACC.
My guess is that there will be no difference until we go to very specific things that might be available only as libraries for nvidia cards highly optimized for it that Cuda will be able to use while openACC will not.
*hehe