eyalroz / cuda-kat

CUDA kernel author's tools
BSD 3-Clause "New" or "Revised" License
105 stars 8 forks source link

Specialize functions with many reads/writes for sub-4-byte element types #38

Open eyalroz opened 4 years ago

eyalroz commented 4 years ago

We have many templated functions which make a (potentially) large number of reads or writes to memory, and therefore benefit from coalescing their memory operations. However, most, if not all of them are not specialized for element types below 4 bytes long, and are therefore slower than they might have been. Examples include copying, filling, appending to global memory etc.

We should add specializations for these cases.