JIT compilation abilities

philipturner commented 1 year ago

In https://github.com/openmm/openmm/issues/3937, I was discussing a hypothetical hipSYCL backend for OpenMM. The framework uses JIT compilation extensively, mostly to implement custom force fields (injecting code blobs into otherwise already-known shaders). It also uses JIT to insert macros or compile-time constants at application run-time. With generic LLVM SSCP, is it possible to modify the generic LLVM IR before the backend translates it into device code? Or perhaps create new functionality for JIT compilation?

illuhad commented 1 year ago

In principle we can perform arbitrary transformations on the generic IR at runtime. However, this is currently mostly not exposed to users (and indeed SYCL itself does not define an API for things like inserting blobs or generic compiler transformations). What is already exposed is injecting runtime values as constants into IR.

philipturner commented 1 year ago

Is it possible to bundle syclcc within the client application, then invoke it to compile code at runtime?

illuhad commented 1 year ago

I mean sure, in theory you can embed an entire LLVM stack into an application ;) Let's maybe take a step back. What is the exact use case for this, what do you hope to accomplish with that?

philipturner commented 1 year ago

@peastman would you be willing to explain the needs here? I'm not saying we'll actually implement the backend. In fact, it could be guidance for the community to develop it externally.

peastman commented 1 year ago

OpenMM is a C++ library that is usually accessed through a Python wrapper. It consists of a public API combined with multiple implementations that are provided by plugins. We currently include implementations based on CUDA and OpenCL, as well as a CPU implementation. Other people have created plugins that provide HIP and Metal implementations.

We make extensive use of runtime kernel compilation to allow users to customize what it computes. For example, a user might have a particular mathematical function they want to use for the interactions between atoms in a molecular dynamics simulation. They might write something like this:

force = CustomNonbondedForce('4*epsilon*((sigma/r)^12-(sigma/r)^6)')

That defines the interaction energy of each pair of atoms as a function of the distance r between them. At runtime we parse that expression to create an abstract syntax tree. Next we analytically differentiate it, since a simulation requires forces which are the derivative of the energy. Then we generate CUDA or OpenCL code to compute the force and energy, and insert that code into a larger kernel that loops over all pairs of atoms and accumulates the force on each one.

SYCL isn't really designed for this sort of thing, since it's based on ahead of time compilation rather than runtime compilation. @philipturner is asking whether there's some way it could be made to work.

illuhad commented 1 year ago

Thanks for those details :-)

SYCL isn't really designed for this sort of thing, since it's based on ahead of time compilation rather than runtime compilation. @philipturner is asking whether there's some way it could be made to work.

I think that's not entirely correct, strictly speaking. All SYCL implementations employ runtime compilation/JIT to various extents. The very first SYCL specifications were actually very prescriptive about that: SYCL would compile to SPIR or SPIR-V IR (I think OpenCL C was allowed too) which would then be JITted by an OpenCL implementation. Modern SYCL specifications are more flexible here and allow more approaches of how things should work - particularly including other backends apart from OpenCL.

Please correct me if I misunderstood, but I believe what you are actually asking for is a separate source model, where host and device code are kept separate such that device source code can be generated or modified at runtime. This is indeed a little tricky in SYCL, because SYCL is based on a single-source model, where device and host code share the same AST such that e.g. C++ templates work seamlessly across the host-device boundary.

If the output format of your program that should be runtime-compiled really has to be a high-level language like C++, I think one option to make it work could be the following:

The hipSYCL SSCP compiler supports SYCL_EXTERNAL functions, i.e. undefined functions marked with SYCL_EXTERNAL attribute that will only be resolved at runtime during JIT compilation.
This means you can define your SYCL kernel as normal and inside it, call an undefined function that represents the runtime-configurable part.
At runtime, you could generate C++ SYCL code for the configurable part, store it in some temp file, and invoke the hipSYCL compiler with -mllvm -hipsycl-sscp-emit-hcf flag. This will create the HCF file with hipSYCL's generic IR.
Load that file into the hipSYCL runtime.
When invoking the kernel, hipSYCL will now automatically resolve the configurable part and generate kernels with that code included.

I realize that it's maybe not the most pleasant thing compared to just invoking, say hipRTC or nvRTC, but it should be something that is possible today. There might be some rough edges around the kernel cache and whether it recognizes that kernels need to be recompiled when a different file is linked into it, but that can all be solved.

peastman commented 1 year ago

That is correct. Kernel source gets generated at runtime.

What you describe with writing the source code to a temp file is exactly what we used to do for CUDA before nvRTC was introduced.

illuhad commented 1 year ago

Ok. I guess if there is some demand for it we could look into creating some (hip)SYCLrtc library to make it more convenient.

One thing we would need to watch out for are compile times. Those tend to be quite long with SYCL (because modern C++ and single huge sycl.hpp header). Not sure what amount of runtime compile time your application can tolerate. If it's a problem, we could look into splitting up sycl.hpp to provide more modular component headers as an extension.

ex-rzr commented 1 year ago

OpenMM caches compiled binaries so compile time is a problem only for the first time when kernels for a particular simulation are being compiled.

philipturner commented 1 year ago

force = CustomNonbondedForce('4*epsilon*((sigma/r)^12-(sigma/r)^6)')

Another example might be from AlphaFold, which I'm hoping to port from Nvidia-only JAX to Apple Metal. Here we have a force and multiple labeled parameters. Not sure how far the scope of JIT-ed text extends beyond this.

  force = openmm.CustomExternalForce(
      "0.5 * k * ((x-x0)^2 + (y-y0)^2 + (z-z0)^2)")
  force.addGlobalParameter("k", stiffness)
  for p in ["x0", "y0", "z0"]:
    force.addPerParticleParameter(p)

Side note: this use case doesn't require mixed or double precision, yet performs energy minimization.

illuhad commented 1 year ago

OpenMM caches compiled binaries so compile time is a problem only for the first time when kernels for a particular simulation are being compiled.

hipSYCL does this already by itself, so if we built this, you would not have to do that yourself anymore ;) The question is if there's a limit to what amount of startup overhead is acceptable.

philipturner commented 1 year ago

The question is if there's a limit to what amount of startup overhead is acceptable.

Say you're running identical simulations in a loop. It might be okay if the first simulation ever performed lags 1 second, and the next few lag 1 millisecond.

illuhad commented 1 year ago

Got it. My point is that SYCL runtime compilation could easily take double the time of HIP or CUDA because the sycl.hpp header is so complex. So, would 2s initially still be acceptable? 3s? etc.

philipturner commented 1 year ago

So, would 2s initially still be acceptable? 3s? etc.

If the only penalty is a constant scaling factor, that's awesome! Do you have benchmarks of e.g. an empty file? By your statement that would take 0x2=0 seconds to compile.

peastman commented 1 year ago

Don't worry about it too much. A few seconds of extra startup time isn't a big deal for a simulation that will run for hours or days. There are situations where it's a problem, but that would just mean the SYCL backend wouldn't be the best choice for those particular situations. And caching helps a lot, since many kernels don't change from one simulation to the next.

philipturner commented 1 year ago

I think if I were to invest significant time into any OpenMM plugin (besides small OpenCL patches), going straight to SYCL would be more worthwhile. That would be real-world validation of hipSYCL's capabilities, and catch bugs in e.g. syclcc distributed into the client. Just like GROMACS would be proving ground for the hipSYCL Metal backend.

The only downside, it wouldn't run on Intel Macs. This is a long-term (years) prospect, at which point most people would use M1/M2/etc.

illuhad commented 1 year ago

If the only penalty is a constant scaling factor, that's awesome! Do you have benchmarks of e.g. an empty file? By your statement that would take 0x2=0 seconds to compile.

I wish ;) There's going to be a constant offset because the sycl.hpp will always be there (assuming you need at least some SYCL functionality, such as the math functions).

Quick test: The following minimal file which exports a single, empty SYCL function

#include <sycl/sycl.hpp>

SYCL_EXTERNAL void myfunc() {}

takes several seconds (3-4) to compile[1] on my notebook. Of course, compile times will then increase only slowly if that function actually starts doing something useful. sycl.hpp pulls in tens of thousands of lines of code.

[1] with -O0 -c. -O0 is sufficient because the device code will be optimized at runtime during JIT phase anyway.

peastman commented 1 year ago

For comparison, CUDA takes about a second per kernel. It looks like SYCL is significantly slower, but still workable.

sycl.hpp pulls in tens of thousands of lines of code.

Does the compiler support precompiled headers?

illuhad commented 1 year ago

Does the compiler support precompiled headers?

Just tried it, it seems to work. Roughly halves compile time for the example to 1.5-2s. Probably still spends a lot of time deducing that almost everything in that header is not needed. We don't use precompiled headers by default because hipSYCL supports many compilation models, which set different macros etc -- but for a "SYCLrtc" library, we would probably only support the generic compiler anyway, so that is a reasonable approach.

EDIT: Looked at the timings, almost all of it is still spend in the clang frontend, before it even reaches our SYCL code paths.

biergaizi commented 9 months ago

That is correct. Kernel source gets generated at runtime.

The separate-source model is a major strength of OpenCL from the perspective of quasi-JIT code generation. Because every OpenCL runtime has a compiler and a linker, it's possible to generate OpenCL code dynamically. In many kernels, a common situation is that some steps are needed while others are not, depending on the simulation setup. In OpenCL, the main program can simply generate a kernel on the fly and disable the unneeded calculations completely via a ifdef flag, so there's zero runtime overhead.

But in the single-source model of SYCL, this is no longer natively supported, making SYCL "less powerful" than OpenCL in a sense. Although JIT is still used, but it only occurs at the IR code level. Implementing zero-overhead feature-toggling means the kernel needs to generate IR code instead of SYCL source code, which is beyond what most high-level developers are comfortable working with.

Currently, my planned workaround is pre-compile all the possible feature combinations at compile time. This is practical for small kernels when the possible combinations are under 100 or 1000, but inelegant. If AdaptiveCpp can provide an official solution of runtime SYCL compilation, it would make SYCL much more powerful in many applications.

biergaizi commented 9 months ago

But in the single-source model of SYCL, this is no longer natively supported, making SYCL "less powerful" than OpenCL in a sense. [...] Implementing zero-overhead feature-toggling means the kernel needs to generate IR code instead of SYCL source code, which is beyond what most high-level developers are comfortable working with.

To be clear, in SYCL there's a special solution to this problem for enabling and disabling features, called Specialization Constant. It allows the program to declare code controlled by some conditional variables, which can then be optimized out at runtime during the JIT compilation. This way, features can be enabled or disabled without overhead.

The more general problem of code generation still requires runtime SYCL compilation.

illuhad commented 9 months ago

To be clear, in SYCL there's a special solution to this problem for enabling and disabling features, called Specialization Constant. It allows the program to declare code controlled by some conditional variables, which can then be optimized out at runtime during the JIT compilation. This way, features can be enabled or disabled without overhead.

In the generic SSCP compiler, we can support this functionality independently of backend capabilities. The problem is that the SYCL API for it is overcomplicated, and detrimental in some cases.

I have some plans for runtime kernel construction in the generic SSCP compiler. The idea is to enable runtime composition of a kernel using codelets. It goes beyond what specialization constants provide. If you want separate source, the better idea might be to just use a separate source model like OpenCL.

fodinabor commented 9 months ago

I'll just add two cents: what you're doing with the if def at runtime is basically kernel specialization. You can achieve the same thing more elegantly in SYCL: use multiple template instantiations of your kernel, selecting the one to use at runtime using an if/else or switch statement. Inside the kernel deactivate the parts you don't need by checking the (boolean) template arguments in an if consexpr.

Edit: Oups.. I guess I only read half the issue ^^ well, that's the way I'd go, but sure, there's a compile time cost if you have a combinatorial explosion..

biergaizi commented 9 months ago

Edit: Oups.. I guess I only read half the issue ^^ well, that's the way I'd go, but sure, there's a compile time cost if you have a combinatorial explosion..

I've read that that this kind of combinatorial explosion is somewhat unavoidable in the GPU world [1], it's why many 3D engines took so long to compile shaders. So I'd accept this outcome, since it's not like there's another choice. Specialization Constant can perhaps used to the reduce the size of the explosion somewhat.

[1] The Shader Permutation Problem - Part 1: How Did We Get Here? https://therealmjp.github.io/posts/shader-permutations-part1/

peastman commented 9 months ago

That's one of the reasons why runtime compilation is so important. A single file might have hundreds of possible combinations of flags that could be defined. But a single simulation only requires one of them, so it's fast and easy to compile the one version we need at runtime.

The other reason, as described above, is that we can generate completely arbitrary code based on user input. Even if we were ok with precompiling hundreds of versions of each kernel, it's just isn't possible.

I don't understand why SYCL doesn't support runtime compilation. It's been a standard feature of GPU programming almost as long as GPUs have been programmable. OpenGL, DirectX, Vulkan, CUDA, OpenCL, Metal all support runtime compilation. How could the designers of SYCL not have realized it's an important feature? (I don't expect an answer to that question. I'm just being grumpy!)

biergaizi commented 9 months ago

I don't understand why SYCL doesn't support runtime compilation. It's been a standard feature of GPU programming almost as long as GPUs have been programmable. [...] How could the designers of SYCL not have realized it's an important feature? (I don't expect an answer to that question. I'm just being grumpy!)

My understanding is that SYCL was originally meant to support runtime compilation via OpenCL, since Khronos Group wants SYCL to be a superset of OpenCL (SyCL's originally meaning is System OpenCL) and runs cross platform via SPIR-V bytecode (which is similar but not identical to Vulkan's SPIR-V) - this design can be seen in SYCL 1.2. Unfortunately, this path was eventually abandoned due to multiple reasons. At Nvidia's side, its OpenCL support has been lukewarm at best, with no support of OpenCL's SPIR-V extension, and it's often claimed that there are unresolved performance gaps between CUDA and OpenCL in certain applications. On AMD's side, AMD similarly refused to support OpenCL's SPIR-V extension, based on its technical decision to use native code (e.g. GCN assembly). Only Intel officially supported the OpenCL SPIR-V approach. In the free software world today, Mesa also recently added experimental support for OpenCL SPIR-V for multiple GPU targets but it's still highly experimental.

This situation forced Khronos Group to change direction in SYCL 2020. Now SYCL is seen as a programming framework in its own and no longer based on OpenCL or anything at all. It now becomes a wildcard, a free for all that assumes very little about the target platform, so now a runtime compiler no longer exists. It's now the responsibility of the compiler to target each platform individually using Ahead-of-Time (AoT) compilation, similar to traditional C/C++ code on CPU. On Nvidia, it runs on top of CUDA. On AMD, it runs on top of HIP, etc. Thus, it becomes rather difficult to do any runtime Just-in-Time (JIT) compilation due to the lack of an intermediate standard bytecode or runtime - such as a SPIR-V bytecode compiler, or a compiler or linker traditionally required by OpenCL.

The fact that AdaptiveCpp is able to target multiple platforms via JIT (based on LLVM IR) in a single executable is already quite an accomplishment in itself - in fact it's currently the only SYCL compiler in existence that capable of doing so. In other SYCL compilers, each GPU target must be separately compiled like how you would compile a CPU program for different CPU ISAs.

If both cross-platform support and runtime code generation are crucial, perhaps the only alternative to SYCL is Vulkan Compute. Vulkan is originally designed for 3D graphics workloads but pure compute workloads are also supported, and it enjoys wide vendor-neutral hardware support.

Another theoretically possible solution is implementing SYCL on top of Vulkan Compute, which was explored in an academic paper Sylkan: Towards a Vulkan Compute Target Platform for SYCL, but I don't think it's so far practical.

peastman commented 9 months ago

At Nvidia's side, its OpenCL support has been lukewarm at best

Not even lukewarm. At least they support OpenCL 3.0, though they have very few optional extensions, so it's not much more than 1.2. Apple hasn't even bothered to do that. They're stuck at 1.2, and will probably never move beyond it. AMD supports more features, but their OpenCL runs at half the speed of HIP on the same hardware. And they also have little interest in doing anything about it.

Another theoretically possible solution is implementing SYCL on top of Vulkan Compute

POCL started creating an OpenCL built on top of Vulkan, but it's incomplete and now abandoned.

It feels like we have all the pieces needed for a cross platform, high quality OpenCL, if we could just fit them together. Mesa has an open source implementation of the API, and (I think?) can compile OpenCL kernels to SPIR-V. LLVM can also compile them. And AdaptiveCpp provides a runtime for executing SPIR-V kernels.

illuhad commented 9 months ago

@biergaizi has summarized the situation and history quite well. We'd be in a different place if all vendors had pulled through with OpenCL, or rather, if OpenCL had not made some design errors in OpenCL 2.0 which limited adoption. On the other hand, conceiving OpenCL alone without a single-source solution like SYCL to accompany it right from the start was never competitive with CUDA, as it left a niche open which a lot of programmers needed filled. Thus most people turned to CUDA.

AMD supports more features, but their OpenCL runs at half the speed of HIP on the same hardware. And they also have little interest in doing anything about it.

I hadn't heard about performance issues in AMD OpenCL before. Functionality-wise, the big stopper is that AMD OpenCL does not support SPIR-V, so there's no IR we could compile code to in order to run on AMD OpenCL. Also, it does not support a reasonably modern pointer-based memory management model like Intel's OpenCL USM extensions (which are supported not only by Intel, but also other implementations like pocl move towards adopting it).

It feels like we have all the pieces needed for a cross platform, high quality OpenCL, if we could just fit them together. Mesa has an open source implementation of the API, and (I think?) can compile OpenCL kernels to SPIR-V.

Yes, but I would not consider it production-ready yet.

The fact that AdaptiveCpp is able to target multiple platforms via JIT (based on LLVM IR) in a single executable is already quite an accomplishment in itself - in fact it's currently the only SYCL compiler in existence that capable of doing so. In other SYCL compilers, each GPU target must be separately compiled like how you would compile a CPU program for different CPU ISAs.

Perhaps it might be possible to implement something like a SYCLrtc in the SSCP compiler. But the device code would have to look quite different from regular SYCL. Kernels need to be expressed differently, local memory needs to be expressed differently, probably accessors to since those come from the host side too...

EDIT:

SyCL's originally meaning is System OpenCL)

Where do you have that from? I'm not aware of any official statement anywhere that describes the origins of the SYCL name. We in the SYCL WG do not consider it to be an acronym. And while of course there was some discussion originally that ended at SYCL, to my knowledge that discussion did not involve "System OpenCL". Because of this, it's also definitely spelled SYCL, not SyCL.

AdaptiveCpp / AdaptiveCpp

JIT compilation abilities #940