eyalroz / cuda-kat

CUDA kernel author's tools
BSD 3-Clause "New" or "Revised" License
106 stars 8 forks source link

Add ld.XY wrapper to builtins and PTX where necessary #62

Open eyalroz opened 4 years ago

eyalroz commented 4 years ago

We currently have the builtin load_global_with_non_coherent_cache(), which translates to the CUDA intrinsic __ldg(), which translates into one of the PTX instruction starting with ld.global.nc. (depending on the size). But - there are other ld.global instructions with different cache policies, which merit a presence in builtins as well. Some of them have CUDA-supplied "intrinsics" already, others may need some functions under the ptx/ subdirectory.