Add ld.XY wrapper to builtins and PTX where necessary

We currently have the builtin load_global_with_non_coherent_cache(), which translates to the CUDA intrinsic __ldg(), which translates into one of the PTX instruction starting with ld.global.nc. (depending on the size). But - there are other ld.global instructions with different cache policies, which merit a presence in builtins as well. Some of them have CUDA-supplied "intrinsics" already, others may need some functions under the ptx/ subdirectory.

eyalroz / cuda-kat

Add ld.XY wrapper to builtins and PTX where necessary #62