Ensure adaptivecpp sycl works with any spirv vulkan target

Describe the motivation for the feature request AdaptiveCpp SYCL should be able to adaptively compile a feature subset of SYCL CPP to any given vulkan version level. It is desirable to have a graceful reduction of cpp features, by vulkan version level. One could also introduce less performant emulation of missing features, like shared memory. Thus AdaptiveCpp will be able to target lower vulkan versions such as 1.1 and 1.0 .
This will allow AdaptiveCpp SYCL to be used on a much broader range of hardware that may not have all the most recent tech-specs.

I ask this because

to find out if on windows, AdaptiveCpp could do a SPIRV approach on nvidia card. without using CUDA
to find out if on linux, a nouveau/nvk/vulkan-1.1/spirv target is possible ( in near future)

AFAICT, the following two merges will happen in near future for linux-kernel and mesa-project respectively.

A linux kernel feature patch (GPUVA VM_BIND UAPI) that enables GPU virtual addresses in kernel nouveau driver This is needed for mesa-nvkm 20230720 Danilo Krummrich - [PATCH drm-misc-next v8 00/12] DRM GPUVA Manager & Nouveau VM_BIND UAPI
https://lore.kernel.org/dri-devel/20230720001443.2380-1-dakr@redhat.com/T/#m88d67c231b63b7340b22dc2a4d1102eed840df00
A mesa-project feature patch (mesa-nvkm) that enables vulkan/rusticl via the nouveau driver Dave Airlie, Draft: enable support for new uapi features https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/150/

I could be wrong.

I read
https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/compilation.md
In it, it seems like the below path is documented as only for intel-GPU

clang-based-flow
-> input sycl code
-> clang-sycl-pass-experimental
-> SPIR-V
-> output binary

Is it that the SPIRV pathway is specified only for Intel, as intel is a new comer into the GPU space without custom middleware?
and that it SPIR-V is the base-case for any GPU without any special AMD/NVIDIA middleware?
Am I correct in predicting that it should work for any accelerator that provides a vulkan/SPIRV target?
Or is it the case that code/features needs to be added to make this work in AdaptiveCpp?

Describe the solution you'd like make open-sycl work on

On windows, make binary work via SPIRV on kepler-era nvidia GT740m using vulkan drivers without CUDA
On linux, make binary work via SPIRV on kepler-era nvidia GT740m using nouveau/nvk/vulkan/spirv

This way CUDA drivers don't need to be present on either OS., and one does not have to code in CUDA-nvcc-lang, and resulting binary would work on GPU from any vendor.

If applicable, describe alternatives you've considered NA

Additional context On win10, NVIDIA has declared End of support, with last nvidia driver version being 425.31. So the kepler mobile GT-740m windows final vulkan version is 1.1.97, as per vulkaninfo cmd output. On linux, when nvkm lands, claimed final vulkan version for the kepler-era cards may also be limited to 1.2, though later GPUs will have higher versions supported [1]

Ref

[1] 202209 mesa-project merge request Karol Herbst - support Kepler, Maxwell and Pascal
https://gitlab.freedesktop.org/nouveau/mesa/-/merge_requests/92#note_1545458
[2] 20230726 phoronix - NVK Merge Request Opened For Landing Open-Source NVIDIA Vulkan Driver In Mesa
https://www.phoronix.com/news/NVK-Merge-Request-Mesa

Please let know what you think, and fill gaps in my understanding.

The SPIR-V we generate is in principle compatible with any valid SPIR-V compute environment. For example, we could also use it to target OpenCL implementations that support SPIR-V (and unified shared memory) if there is a use case.

Vulkan uses a different SPIR-V dialect, namely the SPIR-V shader model. It is not directly compatible with the compute model that OpenCL uses. Generating shader SPIR-V is in principle possible to some extent (see e.g. the Sylkan project) but non-trivial and some SYCL functionality may not work.

Kepler-era GPUs additionally are so old that they don't really support many core SYCL features well, such as unified shared memory. Even on Linux with the CUDA backend, there are already limitations.

This way CUDA drivers don't need to be present on either OS., and one does not have to code in CUDA-nvcc-lang.

We already do not require nvcc.

n it, it seems like the below path is documented as only for intel-GPU clang-based-flow -> input sycl code -> cylang-sycl-pass-experimental -> SPIR-V -> output binary

Our production SPIR-V/Intel support does not go through any clang experimental SYCL passes, but through our own generic single-pass compiler (--opensycl-targets=generic).

Firstly, thank you for your reply and to developers who contribute to AdaptiveCpp.

The below are just some collected info regarding Kepler and unified-shared-memory.

This llvm bug [1] seems to imply unified-shared-memory is possible and already does exist for kepler (sm_30, sm_35, sm_37), and also for fermi (sm_20).

There is the chance that this spirv solution might work on Linux.
For the windows driver situation, the chances seems similar.

Its also okay, to do, if possible, a restricted form of coding that avoids program-patterns whose spirv output cannot be executed.

On windows, vulkaninfo (full text file is attached in [7]) has the following info.
The 2GB GPU memory does have VK_MEMORY_HEAP_DEVICE_LOCAL_BIT (suggested to check in [5])

:
        maxComputeSharedMemorySize              = 0xc000
:
:
VkPhysicalDeviceMemoryProperties:
=================================
    memoryHeapCount       = 2
    memoryHeaps[0] :
        size          = 2107179008 (0x7d990000) (1.96 GiB)
        flags:
            VK_MEMORY_HEAP_DEVICE_LOCAL_BIT
    memoryHeaps[1] :
        size          = 8554295296 (0x1fde03000) (7.97 GiB)
        flags:
            None
    memoryTypeCount       = 11
    memoryTypes[0] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[1] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[2] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[3] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[4] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[5] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[6] :
        heapIndex     = 1
        propertyFlags = 0x0:
    memoryTypes[7] :
        heapIndex     = 0
        propertyFlags = 0x1:
            VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT
    memoryTypes[8] :
        heapIndex     = 0
        propertyFlags = 0x1:
            VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT
    memoryTypes[9] :
        heapIndex     = 1
        propertyFlags = 0x6:
            VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT
            VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
    memoryTypes[10] :
        heapIndex     = 1
        propertyFlags = 0xe:
            VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT
            VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
            VK_MEMORY_PROPERTY_HOST_CACHED_BIT

Ref:

20210421 [Clang][OpenMP] Allow unified_shared_memory for Pascal-generation GPUs. https://reviews.llvm.org/D101595
Developing a Linux Kernel Module using GPUDirect RDMA Section 4.1. Basics of UVA CUDA Memory Management https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#basics-of-uva-cuda-memory-management
NVidia GPU microarchitecture generations: fermi -> kepler -> maxwell ->pascal -> turing -> ampere -> ada https://en.wikipedia.org/wiki/Kepler_(microarchitecture)
Matching sm architectures arch and gencode for various nvidia cards https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
201706 stackoverflow - vulkan on devices that share host memory https://stackoverflow.com/questions/44179130/vulkan-on-devices-that-share-host-memory
techpowerup,com specs on GT740m https://www.techpowerup.com/gpu-specs/geforce-gt-740m.c2299
vulkaninfo full-info file using vulkaninfo tool for GT740m on Win10 20230726_vulkaninfo.txt
OpenCL full-info file using AMD clinfo tool on Win10 (243736 bytes, md5:d483d667eb915ddb54491843e0a214ce link )
NVIDIA: OpenCL 1.2 CUDA 10.1.131, INTEL: OpenCL 1.2
20230802_clinfo_amd.txt

This llvm bug [1] seems to imply unified-shared-memory is possible and already does exist for kepler (sm_30, sm_35, sm_37), and also for fermi (sm_20).

No. The bug report says that these have unified virtual addressing (UVA) which is a different thing. The hardware lacks page-faulting support, so there will not be any fine-grained automatic memory migration. While it is possible to "emulate" unified memory by migrating entire allocations (this is also how USM is implemented on Windows) as mentioned in the bug, this will be inefficient and not a solution for practical programs. There are other limitations in Kepler-era hardware, such as a more limited support for atomic operations.

On Vulkan, the situation is even worse because for a long time Vulkan only had opaque pointers, which completely breaks sharing any data structure that contains pointers between host and device. This has only changed in Vulkan 1.4. So unless these Vulkan implementations move to newer Vulkan versions, it is not a realistic target.

And there are other requirements for Vulkan SPIR-V, such as structured control flow, which additionally limits the constructs that can be expressed and can be limiting for SYCL.

Again, Vulkan shader SPIR-V is not the same thing as SPIR-V for compute environments like OpenCL or Level Zero. SPIR-V actually defines two different execution models: The shader model and the kernel model. Vulkan only supports the shader model, not the kernel model that we need.

It seems like from what you tell me, the SYCL-C++-spir-v-route v is very likely closed to me, despite being a good portable idea, and not to pin all my hopes on it.
So, I'll

investigate other routes,
- plod through the 2+ language problem,
- Futhark-OpenCL-C route, OpenCL-C-spir-v-route [9] [20]
- AMD HIP [10], but ensuring resulting binary can work with Intel or NVidia GPUs (even be it CUDA) without recompilation. Ex HIPSPV [11], cpc/hipcl [12], CHIPSPV/Chipstar [13]
- starpu-runtime/starpu [14]
- halide-lang [15] and haskell-halide [16] has openCL, vulkan-1.0, vulkan-1.2 targets
- OpenCL via rusticl over vulkan [17] [18]
- OpenCL via pocl over vulkan 1.1 [19]
for sycl-C++-spir-v-route, now a bit dissuaded, but after nvkm lands, If I get to trying it, I'll just confirm what you said and document what I find.

Let me know, if you can think of anything else.

As you say, vulkan-compute kernels are different from vulkan-shader-compute and its possible that Mesa/Nouveau are first going to prioritize the shader kernels for graphics use-cases such as games and zink backend for opengl emulation [1] . Mesa/Nouveau is probably chasing their test suite [2] and will advertise/claim vulkan-1.1 on first-release [3]
So what I understood is that in most cases, when Vulkan is mentioned, the implicit context is vulkan-shaders, What I need is vulkan-compute, which is actually opencl that accepts a spirv-dialect. I wonder if compute support is coming included with mesa/nvk. [4]
Vulkan 1.4 isn't going to happen until 2024, as Khronos vulkan is presently on a 2 year cycle, and 1.3 was just released in 2022.
You're right, I misread the llvm bug. Only pascal got past the fence. Should've also read the articles linked in. [7][8] This can only work if the SYCL implementation also supports using the suggested "migration of entire allocations on kernel startup" prefetch method, thereby avoiding memory management and page-faults altogether,i.e. a SYCL implementation sans unified-shared-memory.
With regard to GPUs being able to operate at their rated top performance, amongst the older cards, the open source nouveau kernel driver can only reclock Fermi and Kepler GPUs. Later GPUs require secret proprietary driver code that nvidia does not release. The recently opensourced Nvidia gpuopen drivers resolves the situation for Turing+ GPUs by having a GSP chip, which is initialized at boot-time using a firmware blob, and the secret reclocking functionality is moved into the blob. Nouveau is presently learning from the Nvidia gpuopen drivers and will be adding support of GSP and reclocking for Turing+ [21]. Even so, that leaves Maxwell and Pascal GPUs in the lurch as without the ability to reclock, and so they can only function at less than 10% of max possible performance. So, Fermi and Kepler are the the only older GPUs that have Vulkan-1.1, have UVA and have reclock-ability, which makes them special and so by supporting them one has impact by reaching those users.

This feature req issue could still be of purpose to the AdaptiveCpp SYCL project, in order to ensure AdaptiveCpp SYCL spirv route works for later gen nvidia GPUs, when the nouveau/nvkm driver lands and subsequently catches up to the minimum required vulkan version.

Ref

Mesa zink driver: Gallium driver that emits Vulkan API calls instead of targeting a specific GPU architecture link
20230726 nouveau/mesa Karol Herbst An example of a open merge request that does runs mesa tests: VK_EXT_conditional_rendering, a vulkan 1.1.80 feature link
20230802 (today!!!) nouveau/mesa Faith-Extrand - nvk: Advertise Vulkan 1.1 link
2022 nouveau/mesa David Arlie - Add compute support link
202306 'OpenCL C 1.2 Language on Vulkan' from the Chromium project link github
202301 llvm spirv backend opaque pointers link
20170619 Mark Harris Unified Memory for CUDA Beginners, NVidia dev blog link
20161214 Nikolay Sakharnykh Beyond GPU Memory Limits with Unified Memory on Pascal, NVidia dev blog link
20204020 Kronos Offline Compilation of OpenCL Kernels into SPIR-V Using Open Source Tooling link
AMD's HIP programming language user guide link
20211220 Phoronix LLVM's HIPSPV Coming Together For AMD HIP To SPIR-V For OpenCL Execution link
cpc/hipcl: a library that allows applications using the HIP API to be run on devices which support OpenCL and SPIR-V github
CHIP-SPV/chipstar: compiling and running HIP/CUDA to SPIR-V and run via OpenCL or Level Zero API github
starpu-runtime/starpu: a heterogenous framework for scheduling and offloading OpenCL-C github
halide-lang: a embedded C++ language for fast, portable data-parallel computation link, github
haskell-halide: haskell bindings to halide-lang link, github
rusticl Mesa replacement for clover that compiles OpenCL to spirv and runs using vulkan link
20230213 Phoronix Mesa's Mesa's Rusticl Lands Support For SPIR-V Programs link
pocl: portable opencl has under development a vulkan backend link
clspv: A prototype compiler for a subset of OpenCL C to Vulkan compute shaders github
20221004 Faith Ekstrand - Collobora.com Introducing nvk link

As you say, vulkan-compute kernels are different from vulkan-shader-compute

Almost. Even compute shaders in Vulkan use the shader model to my knowledge. The SPIR-V kernel model is not supported by Vulkan, even for Vulkan compute shaders. You need OpenCL for that.

The route SYCL->SPIR-V->rusticl is much more realistic than Vulkan and could potentially be supported in near to mid-term future.

investigate other routes, Futhark-OpenCL-C route, OpenCL-C-spir-v-route [9]

There are already SYCL implementations that support OpenCL, so no need to look elsewhere if this solves your problem. As I've said we could add an OpenCL backend in the near future if there is a use case.

AMD HIP [10], but ensuring resulting binary can work with Intel or NVidia GPUs (even be it CUDA) without recompilation. Ex HIPSPV [11], cpc/hipcl [12], CHIPSPV/Chipstar [13]

HIP has exactly the same and much more problems than SYCL. As you say, creating a portable binary is not something it was built to do. HIP NVIDIA support will also go through CUDA.

starpu-runtime/starpu [14]

To my knowledge, starpu is a runtime system for automatic work distribution. It's not a compiler, so I don't see how it would solve your issue.

There are already SYCL implementations that support OpenCL, so no need to look elsewhere if this solves your problem.

If you're talking about IntelLLVM/DPC++, here's the (almost) current state of things: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9061. A lot of features are not supported due to Rusticl limitations, and you have to apply some hacks because IntelLLVM generates not fully standard-compliant SPIR-V, but in the end, you can get it to run some examples.

@al42and Yes, I was referring to DPC++. Thanks for the pointer! I had assumed it was farther than that. USM problems are a killer for us, because we are heavily relying on it (even for buffers, which uses USM device allocations. Shared allocations are not so critical).

(And I always get very sad when people assume that SYCL == DPC++ as in this post :/ )

AdaptiveCpp / AdaptiveCpp

Ensure adaptivecpp sycl works with any spirv vulkan target #1097

Ref:

Ref