AdaptiveCpp / AdaptiveCpp

Implementation of SYCL and C++ standard parallelism for CPUs and GPUs from all vendors: The independent, community-driven compiler for C++-based heterogeneous programming models. Lets applications adapt themselves to all the hardware in the system - even at runtime!
https://adaptivecpp.github.io/
BSD 2-Clause "Simplified" License
1.17k stars 154 forks source link

Ensure adaptivecpp sycl works with any spirv vulkan target #1097

Open hgkamath opened 11 months ago

hgkamath commented 11 months ago

Describe the motivation for the feature request AdaptiveCpp SYCL should be able to adaptively compile a feature subset of SYCL CPP to any given vulkan version level. It is desirable to have a graceful reduction of cpp features, by vulkan version level. One could also introduce less performant emulation of missing features, like shared memory. Thus AdaptiveCpp will be able to target lower vulkan versions such as 1.1 and 1.0 .
This will allow AdaptiveCpp SYCL to be used on a much broader range of hardware that may not have all the most recent tech-specs.

I ask this because

  1. to find out if on windows, AdaptiveCpp could do a SPIRV approach on nvidia card. without using CUDA
  2. to find out if on linux, a nouveau/nvk/vulkan-1.1/spirv target is possible ( in near future)

AFAICT, the following two merges will happen in near future for linux-kernel and mesa-project respectively.

I could be wrong.

I read
https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/compilation.md
In it, it seems like the below path is documented as only for intel-GPU

Is it that the SPIRV pathway is specified only for Intel, as intel is a new comer into the GPU space without custom middleware?
and that it SPIR-V is the base-case for any GPU without any special AMD/NVIDIA middleware?
Am I correct in predicting that it should work for any accelerator that provides a vulkan/SPIRV target?
Or is it the case that code/features needs to be added to make this work in AdaptiveCpp?

Describe the solution you'd like make open-sycl work on

This way CUDA drivers don't need to be present on either OS., and one does not have to code in CUDA-nvcc-lang, and resulting binary would work on GPU from any vendor.

If applicable, describe alternatives you've considered NA

Additional context On win10, NVIDIA has declared End of support, with last nvidia driver version being 425.31. So the kepler mobile GT-740m windows final vulkan version is 1.1.97, as per vulkaninfo cmd output. On linux, when nvkm lands, claimed final vulkan version for the kepler-era cards may also be limited to 1.2, though later GPUs will have higher versions supported [1]

Ref

Please let know what you think, and fill gaps in my understanding.

illuhad commented 11 months ago

The SPIR-V we generate is in principle compatible with any valid SPIR-V compute environment. For example, we could also use it to target OpenCL implementations that support SPIR-V (and unified shared memory) if there is a use case.

Vulkan uses a different SPIR-V dialect, namely the SPIR-V shader model. It is not directly compatible with the compute model that OpenCL uses. Generating shader SPIR-V is in principle possible to some extent (see e.g. the Sylkan project) but non-trivial and some SYCL functionality may not work.

Kepler-era GPUs additionally are so old that they don't really support many core SYCL features well, such as unified shared memory. Even on Linux with the CUDA backend, there are already limitations.

This way CUDA drivers don't need to be present on either OS., and one does not have to code in CUDA-nvcc-lang.

We already do not require nvcc.

n it, it seems like the below path is documented as only for intel-GPU clang-based-flow -> input sycl code -> cylang-sycl-pass-experimental -> SPIR-V -> output binary

Our production SPIR-V/Intel support does not go through any clang experimental SYCL passes, but through our own generic single-pass compiler (--opensycl-targets=generic).

hgkamath commented 11 months ago

Firstly, thank you for your reply and to developers who contribute to AdaptiveCpp.

The below are just some collected info regarding Kepler and unified-shared-memory.

This llvm bug [1] seems to imply unified-shared-memory is possible and already does exist for kepler (sm_30, sm_35, sm_37), and also for fermi (sm_20).

There is the chance that this spirv solution might work on Linux.
For the windows driver situation, the chances seems similar.

Its also okay, to do, if possible, a restricted form of coding that avoids program-patterns whose spirv output cannot be executed.

On windows, vulkaninfo (full text file is attached in [7]) has the following info.
The 2GB GPU memory does have VK_MEMORY_HEAP_DEVICE_LOCAL_BIT (suggested to check in [5])

:
        maxComputeSharedMemorySize              = 0xc000
:
:
VkPhysicalDeviceMemoryProperties:
=================================
    memoryHeapCount       = 2
    memoryHeaps[0] :
        size          = 2107179008 (0x7d990000) (1.96 GiB)
        flags:
            VK_MEMORY_HEAP_DEVICE_LOCAL_BIT
    memoryHeaps[1] :
        size          = 8554295296 (0x1fde03000) (7.97 GiB)
        flags:
            None
    memoryTypeCount       = 11
    memoryTypes[0] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[1] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[2] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[3] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[4] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[5] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[6] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[7] :
        heapIndex     = 0
        propertyFlags = 0x1:
            VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT
    memoryTypes[8] :
        heapIndex     = 0
        propertyFlags = 0x1:
            VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT
    memoryTypes[9] :
        heapIndex     = 1
        propertyFlags = 0x6:
            VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT
            VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
    memoryTypes[10] :
        heapIndex     = 1
        propertyFlags = 0xe:
            VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT
            VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
            VK_MEMORY_PROPERTY_HOST_CACHED_BIT

Ref:

  1. 20210421 [Clang][OpenMP] Allow unified_shared_memory for Pascal-generation GPUs. https://reviews.llvm.org/D101595
  2. Developing a Linux Kernel Module using GPUDirect RDMA Section 4.1. Basics of UVA CUDA Memory Management https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#basics-of-uva-cuda-memory-management
  3. NVidia GPU microarchitecture generations: fermi -> kepler -> maxwell ->pascal -> turing -> ampere -> ada https://en.wikipedia.org/wiki/Kepler_(microarchitecture)
  4. Matching sm architectures arch and gencode for various nvidia cards https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
  5. 201706 stackoverflow - vulkan on devices that share host memory https://stackoverflow.com/questions/44179130/vulkan-on-devices-that-share-host-memory
  6. techpowerup,com specs on GT740m https://www.techpowerup.com/gpu-specs/geforce-gt-740m.c2299
  7. vulkaninfo full-info file using vulkaninfo tool for GT740m on Win10 20230726_vulkaninfo.txt
  8. OpenCL full-info file using AMD clinfo tool on Win10 (243736 bytes, md5:d483d667eb915ddb54491843e0a214ce link )
    NVIDIA: OpenCL 1.2 CUDA 10.1.131, INTEL: OpenCL 1.2
    20230802_clinfo_amd.txt
illuhad commented 11 months ago

This llvm bug [1] seems to imply unified-shared-memory is possible and already does exist for kepler (sm_30, sm_35, sm_37), and also for fermi (sm_20).

No. The bug report says that these have unified virtual addressing (UVA) which is a different thing. The hardware lacks page-faulting support, so there will not be any fine-grained automatic memory migration. While it is possible to "emulate" unified memory by migrating entire allocations (this is also how USM is implemented on Windows) as mentioned in the bug, this will be inefficient and not a solution for practical programs. There are other limitations in Kepler-era hardware, such as a more limited support for atomic operations.

On Vulkan, the situation is even worse because for a long time Vulkan only had opaque pointers, which completely breaks sharing any data structure that contains pointers between host and device. This has only changed in Vulkan 1.4. So unless these Vulkan implementations move to newer Vulkan versions, it is not a realistic target.

And there are other requirements for Vulkan SPIR-V, such as structured control flow, which additionally limits the constructs that can be expressed and can be limiting for SYCL.

Again, Vulkan shader SPIR-V is not the same thing as SPIR-V for compute environments like OpenCL or Level Zero. SPIR-V actually defines two different execution models: The shader model and the kernel model. Vulkan only supports the shader model, not the kernel model that we need.

hgkamath commented 11 months ago

It seems like from what you tell me, the SYCL-C++-spir-v-route v is very likely closed to me, despite being a good portable idea, and not to pin all my hopes on it.
So, I'll

Let me know, if you can think of anything else.

This feature req issue could still be of purpose to the AdaptiveCpp SYCL project, in order to ensure AdaptiveCpp SYCL spirv route works for later gen nvidia GPUs, when the nouveau/nvkm driver lands and subsequently catches up to the minimum required vulkan version.

Ref

  1. Mesa zink driver: Gallium driver that emits Vulkan API calls instead of targeting a specific GPU architecture link
  2. 20230726 nouveau/mesa Karol Herbst An example of a open merge request that does runs mesa tests: VK_EXT_conditional_rendering, a vulkan 1.1.80 feature link
  3. 20230802 (today!!!) nouveau/mesa Faith-Extrand - nvk: Advertise Vulkan 1.1 link
  4. 2022 nouveau/mesa David Arlie - Add compute support link
  5. 202306 'OpenCL C 1.2 Language on Vulkan' from the Chromium project link github
  6. 202301 llvm spirv backend opaque pointers link
  7. 20170619 Mark Harris Unified Memory for CUDA Beginners, NVidia dev blog link
  8. 20161214 Nikolay Sakharnykh Beyond GPU Memory Limits with Unified Memory on Pascal, NVidia dev blog link
  9. 20204020 Kronos Offline Compilation of OpenCL Kernels into SPIR-V Using Open Source Tooling link
  10. AMD's HIP programming language user guide link
  11. 20211220 Phoronix LLVM's HIPSPV Coming Together For AMD HIP To SPIR-V For OpenCL Execution link
  12. cpc/hipcl: a library that allows applications using the HIP API to be run on devices which support OpenCL and SPIR-V github
  13. CHIP-SPV/chipstar: compiling and running HIP/CUDA to SPIR-V and run via OpenCL or Level Zero API github
  14. starpu-runtime/starpu: a heterogenous framework for scheduling and offloading OpenCL-C github
  15. halide-lang: a embedded C++ language for fast, portable data-parallel computation link, github
  16. haskell-halide: haskell bindings to halide-lang link, github
  17. rusticl Mesa replacement for clover that compiles OpenCL to spirv and runs using vulkan link
  18. 20230213 Phoronix Mesa's Mesa's Rusticl Lands Support For SPIR-V Programs link
  19. pocl: portable opencl has under development a vulkan backend link
  20. clspv: A prototype compiler for a subset of OpenCL C to Vulkan compute shaders github
  21. 20221004 Faith Ekstrand - Collobora.com Introducing nvk link
illuhad commented 11 months ago

As you say, vulkan-compute kernels are different from vulkan-shader-compute

Almost. Even compute shaders in Vulkan use the shader model to my knowledge. The SPIR-V kernel model is not supported by Vulkan, even for Vulkan compute shaders. You need OpenCL for that.

The route SYCL->SPIR-V->rusticl is much more realistic than Vulkan and could potentially be supported in near to mid-term future.

investigate other routes, Futhark-OpenCL-C route, OpenCL-C-spir-v-route [9]

There are already SYCL implementations that support OpenCL, so no need to look elsewhere if this solves your problem. As I've said we could add an OpenCL backend in the near future if there is a use case.

AMD HIP [10], but ensuring resulting binary can work with Intel or NVidia GPUs (even be it CUDA) without recompilation. Ex HIPSPV [11], cpc/hipcl [12], CHIPSPV/Chipstar [13]

HIP has exactly the same and much more problems than SYCL. As you say, creating a portable binary is not something it was built to do. HIP NVIDIA support will also go through CUDA.

starpu-runtime/starpu [14]

To my knowledge, starpu is a runtime system for automatic work distribution. It's not a compiler, so I don't see how it would solve your issue.

al42and commented 11 months ago

There are already SYCL implementations that support OpenCL, so no need to look elsewhere if this solves your problem.

If you're talking about IntelLLVM/DPC++, here's the (almost) current state of things: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9061. A lot of features are not supported due to Rusticl limitations, and you have to apply some hacks because IntelLLVM generates not fully standard-compliant SPIR-V, but in the end, you can get it to run some examples.

illuhad commented 11 months ago

@al42and Yes, I was referring to DPC++. Thanks for the pointer! I had assumed it was farther than that. USM problems are a killer for us, because we are heavily relying on it (even for buffers, which uses USM device allocations. Shared allocations are not so critical).

(And I always get very sad when people assume that SYCL == DPC++ as in this post :/ )