Open tkoskela opened 2 years ago
TODO: Rewrite memory movement with buffers and accessors could be more performant.
The memory movement is most likely due to too much register pressure spilling into local memory. NCU profiling shows the CUDA code is using 48 registers per thread, while dpcpp compiled SYCL uses 116 and hipsycl compiled SYCL uses 168.
As noted in UCL/openqcd-oneapi#14 there appears to be unnecessary memory movement in the ported sycl code. It should be possible to match the memory copies in the original cuda implementation.
List of tasks