With our new emphasis on incorporating grain size to ensure more meaningful computation is taking place in the parallel_for(), we essentially are using the serial version here (just over a smaller set of trapezoids). Unfortunately, this has the drawback of having duplicated code/logic and could come to haunt us later if we want to change it (even though our algorithm is correct and stable, given its relatively straightforward nature). It could well be that the compiler needs to see all of this code (twice) to generate different versions of host and accelerator-specific code.
I wonder if we can try a couple of quick experiments (at some point) to gain a better understanding of why we need to do this.
Experiment 1 would be try using a function for the sequential version (but within the same unit of translation, main.cpp) and see whether it can be used in the parallel_for(). The second (slightly less desirable option) is to use a macro. I suspect that this would work because the compiler could statically analyze the code after macro expansion and generate the specific targets independently.
We can live with things the way they are, so it is not urgent. I think knowing the answer will also help us to make things clear to readers when it comes to how SYCL is really doing its thing.
With our new emphasis on incorporating grain size to ensure more meaningful computation is taking place in the
parallel_for()
, we essentially are using the serial version here (just over a smaller set of trapezoids). Unfortunately, this has the drawback of having duplicated code/logic and could come to haunt us later if we want to change it (even though our algorithm is correct and stable, given its relatively straightforward nature). It could well be that the compiler needs to see all of this code (twice) to generate different versions of host and accelerator-specific code.I wonder if we can try a couple of quick experiments (at some point) to gain a better understanding of why we need to do this.
Experiment 1 would be try using a function for the sequential version (but within the same unit of translation, main.cpp) and see whether it can be used in the
parallel_for()
. The second (slightly less desirable option) is to use a macro. I suspect that this would work because the compiler could statically analyze the code after macro expansion and generate the specific targets independently.We can live with things the way they are, so it is not urgent. I think knowing the answer will also help us to make things clear to readers when it comes to how SYCL is really doing its thing.