Alternatives to using nvcc / cmake for codegen?

Balandat commented 3 years ago

We've been running into some trouble trying to use KeOps in heterogeneous / non-standard build environments, where we cannot expect or easily ensure that cmake are nvcc are available at runtime in all settings. This has come up a lot and really limits our ability to use KeOps in a less-than-ad-hoc way. Would it be possible to add support that uses alternative ways of setting up the build env?

For instance, one might be able to use libnvrtc for CUDA and NNC (with PyTorch backend) for CPU code?

Here are @ngimel's thoughts on this:

For cuda codegen, it would be great if they could use nvrtc instead of cmake/nvcc. Pytorch native gpu fuser is using this mechanism. Nvrtc is part of the cuda driver, thus is automatically present at runtime. I have not looked at Keops codegen in detail, my impression is they use a lot of template metaprogramming. nvrtc can be inconvenient to use with that, but nvidia provides a helper package, https://github.com/NVIDIA/jitify, to help compile templated code with nvrtc. I'm not sure what's the best path for cpu side codegen.

We would likely have to do some work on our end with NNC to make this work, but I'm curious what changes would be required on the KeOps side to enable this, how much work those would be, and whether you'd be willing to help out with making them happen. It would be awesome if this did "just work" in PyTorch without the user having to worry about their cmake/nvcc setups.

cc @dme65, @bertmaher

jeanfeydy commented 3 years ago

Hi @Balandat ,

Thanks for your remark, I believe that this a very good suggestion!

As noted by @ngimel, the current KeOps++ engine relies heavily on template meta-programming. This is a choice that made sense back in 2017 when we started working on a generic maths engine: it allowed us to express all the necessary "symbolic" computations with a C++-only code base in a way that is fairly well documented online and 100% independent from the scripting language (Python, Matlab, R, etc.).

Three years down the line, however, we are starting to feel the limitations of this method: the heavy reliance on nvcc/cmake and the long-ish compilation times are becoming less and less acceptable. Some of our hacky workarounds (e.g. the Pack.h headers to handle templated lists of variable with a C++11 compiler) have also become obsolete, with better solutions becoming supported by standard compilers.

With all of this in mind, we are targeting a full rewrite of the KeOps++ engine in the first half of 2021. @bcharlier and @joanglaunes (who are closer to compilation issues than I am) have started by streamlining the compilation process (esp. the PyBind11 linking) to reduce typical compilation times from 10s-20s to 3s-5s: their branch has just been merged into master.

Going further, we are indeed planning to replace the C++ templates by a "python-side" code generation. Nvcc isn't optimized to handle large recursive templates and we believe that switching to this new method would both accelerate compilation further and allow us to implement advanced optimizations (e.g. for Tensor cores) more easily. Once this is done, we'd be more than happy to see if we can ditch nvcc/cmake altogether and get genuine JIT compilation with modern tools!

Do you have first-hand experience with these new frameworks, or know anyone who would? To be honest, KeOps has been mostly motivated/developed in French academic circles and we haven't had the chance to meet any genuine expert in CUDA programming / JIT compilation just yet. We'd be very willing to work on these questions, but would need to have a good discussion about the capabilities and main limitations of these frameworks before implementing anything :-)

Best regards, Jean

Balandat commented 3 years ago

Hi @jeanfeydy,

Thanks for the comprehensive response, this all sounds very promising.

I am definitely not an expert on JIT or CUDA matters, but both @bertmaher and @ngimel either are or should at least be able to point us to the right folks to talk to.

Cheers, Max

bertmaher commented 3 years ago

If it's helpful to look at what we've been doing for codegen in the PyTorch JIT, check out the tensor expression JIT compiler (link). On the CUDA side we use nvrtc (see cuda_codegen.cpp), and on the CPU side we use LLVM (llvm_codegen.cpp). The intermediate representation used by those backends is a fairly straightforward AST so hopefully it's not too convoluted. Also on the CPU front, if it's desirable to use C++ as an IR, it should be possible to do so entirely in-process using libclang (which would circumvent some of the deployment headaches @Balandat mentions in the description).

bcharlier commented 3 years ago

Hi there,

thank you for your interesting post! I confess to be a bit jealous of the JIT compilation framework like numba or torch. I wish we can get rid of our good old compilation stack with cmake/nvcc.

It could save us a lot of energy to have a discussion with someone able to give us a JIT 101 talk...

Before that, to make keops work on your exotic environment: what about precompile the binaries a bit like in issue #85?

Balandat commented 3 years ago

@bcharlier that's an interesting approach, but it seems that this would only work for deploying models that use KeOps where you know all the "formulas" that you are going to encounter and can precompile them? Specifically, if this were to be used in an interactive fashion where the user implements new models and hence "formulas" this won't work right?

Balandat commented 3 years ago

cc @jacobrgardner, @gpleiss, @wjmaddox

joanglaunes commented 3 years ago

Hello, I have done some experiments along these lines and have a question for JIT experts ;) As @jeanfeydy explained, we are in the process of rewriting the formula and autodiff mechanism in Python, so that we can produce simple C++ code without any templates. This allows already to avoid cmake and speed up compilation times a lot. Then I have also managed to compile this code using nvrtc ; which produces a ptx file, which is then loaded as a cuda kernel and launched. So everything works great and the compilation is even faster than with nvcc, so far this is great. Now my problem is that when I call this c++ code from Python for a formula that has been already compiled, it will read the ptx file from the disk, then load it again as a Cuda kernel, and this operation takes around 0.1s, and of course it is not acceptable to have such a delay for every KeOps call. So what is the proper way to avoid this delay ? The only way I see is maybe to launch the main dll at the "import pykeops" statement, and then keep it alive in some way during all the session, so that it can keep in memory the cuda kernels, and then communicate with it from Python (I guess the way I explain this is very naive sorry, I don't know much in this area). This way the delay would occur only at the first use of the formula in the Python session, and of course it becomes completely ok since it is comparable with the delay for launching our dll in our current framework. Is that the way this is handled in PyTorch ? Or is there a simpler way to proceed ?

Balandat commented 3 years ago

This sounds like great progress, glad to hear this! Maybe @bertmaher or @ngimel can help with your follow-up question?

ngimel commented 3 years ago

I guess I don't understand why ptx file needs to be serialized on disk instead of being kept in RAM? Here's an example of how launching jit-compiled kernels is done in pytorch https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp. They have a CudaCodeGen object that corresponds to one kernel. CompileToNVRTC method produces function_ that later is and launched by call. There is another piece of code somewhere to look up correct CudaCodeGen object corresponding to the current compilation. It does have some overhead, but that's comparable to regular kernel launch, and definitely not 0.1 s.

joanglaunes commented 3 years ago

Thanks @ngimel for your reply. I think I understand roughly how things work in this cuda_codegen.cpp file, but my question is more about the interplay between Python and C++. What I do not understand is how these CudaCodeGen objects (or similar) can be kept alive in RAM between consecutive calls to the dll from the Python script. In what I have tried, the first time the KeOps dll is launched from the Python script, it performs the compilation and launches the kernel, then exits, so every object created in RAM is lost. This is why currently I need to save the ptx file to disk, so that when the dll is launched a second time from Python for the same formula, it can read the ptx file instead of recompiling.

ngimel commented 3 years ago

Nvidia announced python wrappers for driver and nvrtc apis, which also might be useful https://developer.nvidia.com/blog/unifying-the-cuda-python-ecosystem/

Balandat commented 2 years ago

@joanglaunes, @jeanfeydy, checking in here, were you able to make additional progress on this?

joanglaunes commented 2 years ago

Hello @Balandat , thanks for your message, yes actually we have made big progress on this point, I should have updated you earlier. We have been working on the branch python_engine, and it has been basically ready for merging in master and for doing a new release for a few weeks, but we need to do some more testing and lack time right now. Also, since it changes completely the compilation workflow (we eliminated cmake and nvcc compiler), we are a bit afraid that the update will break other projects, although there is no change in the user interface. Here is a short summary of what's new in this branch :

All the core code of KeOps, which was mainly done using C++ templates, is now done in Python, meaning that whenever a new formula needs to be compiled, Python is used for metaprogramming, outputing a c++/cuda code free of templates. This makes the complete workflow (metaprogramming plus compilation) way faster than it was previoulsy using cmake and templates, around one second or less sometimes, whereas it was more around the minute previously. This was our main motivation in fact, together with the fact that coding is cleaner and much easier using python than with templates.
For the Gpu part, the compilation is done using nvrtc, and I managed to solve the speed issues I had (cf previous comments).
For the Cpu part, I have not coded an equivalent way for doing JIT compilation like nvrtc, so the compilation is done via a call to a standard command line compiler. Since you told me you know how to do JIT compilation via NNC, I would be happy to have some advices or help on this !
Also, to do the interface between the python wrapper and the main dll, I use a package named cppyy. I had previously used Ctypes but it is slower and I could not keep the main program alive during consecutive calls, which was a key problem. I don't know if this cppyy requirement is a bottleneck, if it is then there may be other ways using Python extensions, but it requires more work.. To install the python_engine branch via pip, to use it as if it was the current release, you can do: pip install git+https://github.com/getkeops/keops.git@python_engine I think it would be great if we could have your feedback about this. Would yo have time to test the branch and let us know, first if it works, and also if you think it will help regarding your initial question ? Also, as I said earlier if you can help us for the JIT compilation in Cpu mode, it would be great!

Balandat commented 2 years ago

Hey @joanglaunes, thanks a lot for the update. Looks like you have indeed made a lot of progress, chapeau!

I looked a bit into cppyy, seems like it's itself made up of a few libraries. I'll see how hard it is to get those to build on our infra (unfortunately we can't just pip install stuff). But at least it seems that there is no runtime dependency on cmake or other build systems anymore (well except for the current CPU setup), which is great.

Balandat commented 2 years ago

Hmm I realize cppyy also requires https://github.com/wlav/cppyy-backend, which means that building this from source requires building LLVM. So unfortunately I don't think this setup will work for my use case (I might be wrong and will check to verify though).

joanglaunes commented 2 years ago

Ok I see... Actually I am not that happy to have this dependency on cppyy, it chose it more because I could get things done easily with it, but sometimes the pip installation of cppyy fails on some systems. Do you have an idea about which framework to do the interface would be lighter and ok in your case ? One possibility maybe would be to use the "Cuda Python" wrappers that @ngimel pointed out, although it is restricted to the Gpu case.

getkeops / keops

Alternatives to using nvcc / cmake for codegen? #130