TO DO: Benchmarks - Githubissues

Here's a preliminary plan:

Setup a simple toolchain
1. familiarise ourselfs with CUDA
2. familiarise ourselfs with FeynArts/FormCalc
3. generate the expression for the simplest process (DY LO) using FeynArts/FormCalc
4. convert the expression into CUDA C++
Test the toolchain
1. generate a few phase space points (1, 10, 100, 1000, ...)
2. load them onto the GPU
3. calculate in parallel the squared matrix elements
4. make sure the values are correct
5. check how efficient this is: how many points should be processed in parallel to be efficient, can we make full use of all cores, what's the memory limitation, what's the penalty for the CPU <-> GPU communication, etc.
Build the remaining parts for a full Monte Carlo
1. generate phase space on the GPU and cut them there or communicate the PS points that passed?
2. write the rest of the Monte Carlo
Improve the toolchain
1. Try virtual matrix elements; this will require a CUDA loop library
2. ... ?

N3PDF / mcgpu