UCL / openqcd-oneapi

GNU General Public License v2.0
0 stars 0 forks source link

Optimize memory movement #1

Open tkoskela opened 2 years ago

tkoskela commented 2 years ago

As noted in UCL/openqcd-oneapi#14 there appears to be unnecessary memory movement in the ported sycl code. It should be possible to match the memory copies in the original cuda implementation.

List of tasks

tkoskela commented 2 years ago

TODO: Rewrite memory movement with buffers and accessors could be more performant.

tkoskela commented 2 years ago

The memory movement is most likely due to too much register pressure spilling into local memory. NCU profiling shows the CUDA code is using 48 registers per thread, while dpcpp compiled SYCL uses 116 and hipsycl compiled SYCL uses 168. image