I open this as an issue because I was thinking that, given OpenCL is the only thing that works in all devices, we should drop cuda altogether.
However, according to this paper https://arxiv.org/pdf/1005.2581.pdf Cuda is notably better even when using basically the same code. So I would suggest focusing in OpenCL until everything is working and then porting the final version to Cuda for comparison.
(opening this as an issue mainly to have a reference and not forget about it)
I open this as an issue because I was thinking that, given OpenCL is the only thing that works in all devices, we should drop cuda altogether.
However, according to this paper https://arxiv.org/pdf/1005.2581.pdf Cuda is notably better even when using basically the same code. So I would suggest focusing in OpenCL until everything is working and then porting the final version to Cuda for comparison.
(opening this as an issue mainly to have a reference and not forget about it)