generate the expression for the simplest process (DY LO) using FeynArts/FormCalc
convert the expression into CUDA C++
Test the toolchain
generate a few phase space points (1, 10, 100, 1000, ...)
load them onto the GPU
calculate in parallel the squared matrix elements
make sure the values are correct
check how efficient this is: how many points should be processed in parallel to be efficient, can we make full use of all cores, what's the memory limitation, what's the penalty for the CPU <-> GPU communication, etc.
Build the remaining parts for a full Monte Carlo
generate phase space on the GPU and cut them there or communicate the PS points that passed?
write the rest of the Monte Carlo
Improve the toolchain
Try virtual matrix elements; this will require a CUDA loop library