We currently have the builtin load_global_with_non_coherent_cache(), which translates to the CUDA intrinsic __ldg(), which translates into one of the PTX instruction starting with ld.global.nc. (depending on the size). But - there are other ld.global instructions with different cache policies, which merit a presence in builtins as well. Some of them have CUDA-supplied "intrinsics" already, others may need some functions under the ptx/ subdirectory.
We currently have the builtin
load_global_with_non_coherent_cache()
, which translates to the CUDA intrinsic__ldg()
, which translates into one of the PTX instruction starting withld.global.nc.
(depending on the size). But - there are otherld.global
instructions with different cache policies, which merit a presence inbuiltins
as well. Some of them have CUDA-supplied "intrinsics" already, others may need some functions under theptx/
subdirectory.