EnsembleMPGOS - Githubissues

ChrisRackauckas commented 4 years ago

@FerencHegedus @nnagyd have been putting a lot of work into tuning ensemble GPU parallel CUDA kernel explicit solvers for non-stiff ODEs so we should probably should make use of that. Our GPU stack can be used to generate device code from Julia functions, so that fixes what would probably be the most major hurdle. @FerencHegedus, could you help suggest a way to wrap this, like what level should we interact with the library that would be most beneficial? I think we can have a fixed driver code a la

https://github.com/FerencHegedus/Massively-Parallel-GPU-ODE-Solver/blob/master/Tutorials_SingleSystem_PerThread/T2_Lorenz/Lorenz.cu

that we pass things into, and then generate and compile the system code

https://github.com/FerencHegedus/Massively-Parallel-GPU-ODE-Solver/blob/master/Tutorials_SingleSystem_PerThread/T2_Lorenz/Lorenz_SystemDefinition.cuh

@vchuravy might need to help out here, and we might need to get MPGOS onto binary builder for this to work.

FerencHegedus commented 4 years ago

Hi @ChrisRackauckas, I am not 100% sure what you mean to interact with "the library": to fuse Julia with MPGOS or you want to build-up your own solver/library with Julia?

In any case, one should generate a single monolithic kernel, otherwise, the code has to load everything from the global memory in every kernel launch. If you do it every time step, it is a huge amount of overhead.

One of the things that were really beneficial for me is to load EVERYTHING into registers (even if I have complex ODE like the Keller-Miksis equation and have some amount of register spills). The register spill effectively means load from global memory, but I have to load the variables from global memory anyway. In case of the simple Lorenz system, every used data fit into the registers; however, if I do not pre-load these values explicitly into registers, the compiler does not do it automatically and cannot generate highly optimised code. The benefit of this technique was a huge simplification of the memory address calculations, and I saved tons of integer calculations. The performance increase was also significant. However, I did not check that the previous code was saturated by the integer peak throughput or not. Also, the code became more clear. Later, I realised that the restrict qualifier to my pointers might have been solved this issue, but I have already rewritten my code.

To be able to do the above thing, and allocate arrays in the device code, the array size has to be known at compile time. That is why I used template parameters. But the consequence is that you cannot use a static driver code, and a problem depended on system code for ODE functions. They must be compiled together.

Some years ago, I tested a case where I had the driver code and the ODE functions in different C++ modules. Everybody says that now the compilers can perform "intermodul" optimisations, but still I lost 30% of my performance even in the simple Duffing oscillator. This is another reason for compiling everything together.

I think playing with block sizes plays a minor role in problems we are talking about. What it is more important to regulate the register spills, and with it, the occupancy. However, using a different number of max registers, the code has to be recompiled. Since you want to metaprogram the solver from Julia, you can easily vary this as a parameter and suggest some optimised setup.

You are already in this path, but I cannot stress enough that the asynchronous treatment of the systems is important, not only from the performance point of view. You simply cannot solve certain problems without it.

Okay, this was a bit lengthy, but I hope I could help a bit to choose a good direction for you.

What is the language of the code you generate in Julia?

ChrisRackauckas commented 4 years ago

In any case, one should generate a single monolithic kernel, otherwise, the code has to load everything from the global memory in every kernel launch. If you do it every time step, it is a huge amount of overhead.

KernelAbstractions.jl already does that.

What is the language of the code you generate in Julia?

It's direct to LLVM with its .ptx backend, though I could target C/C++ if necessary (with some more restrictions).

ChrisRackauckas commented 4 years ago

I am not 100% sure what you mean to interact with "the library": to fuse Julia with MPGOS or you want to build-up your own solver/library with Julia?

We're building an EnsembleAsyncGPU, but that's a different project. This issue is just for making EnsembleMPGOS be a dispatch that generates and solves an ensemble with MPGOS, which in theory should be possible.

FerencHegedus commented 4 years ago

Ok, I understand. Yes theoretically, it should be possible. If I can help you with MPGOS, just let me know.

FerencHegedus commented 3 years ago

Hi Chris, sorry for the sometimes big lag, the November was terrible in terms of administration and online teaching stuff for me.

Is it still an open topic for you to make MPGOS callable from Julia? If yes, it would be nice to make a kind of interface. However, unfortunately, we have no experience with metaprogramming C++ from other languages (so far). For instance, I cannot see how the full capabilities of MPGOS could be achieved (e.g. overlapping CPU-GPU computations). Therefore, if we decide to start this project, can we ask some help at least from Julia part? And discuss how we should modify the C++ code and start this whole project.

Thanks a lot in advance.

ChrisRackauckas commented 3 years ago

Yes, I don't think we plan to tackle it immediately, but it's something to leave open and find someone interested in for. We can metaprogram C++ from Julia with Cxx.jl which allows for handling templates and all of that. Or we can use ModelingToolkit.jl to get the symbolic form of equations (for a subset of problems) and use that to write the code for MPGOS. Either way, there's a way to make it work, it's not the easiest project, but I'll be looking for people(/students) interested in building such a bridge.

FerencHegedus commented 3 years ago

Okay, when you feel we should start this just let us know.

ChrisRackauckas commented 1 year ago

Not necessarily needed anymore as the new EnsembleGPUKernel matches the performance by doing a similar kernel construction. See https://github.com/utkarsh530/GPUODEBenchmarks for details. Thanks for your benchmarks, it set us on the right path.

FerencHegedus commented 1 year ago

Hi Chris, nice work, I dowloaded tha arxiv paper yesterday, not have to read yet though. If you send me the raw data points, I will change the figure at www.gpuode.com.

ChrisRackauckas commented 1 year ago

@utkarsh530

utkarsh530 commented 1 year ago

Hi,

The data can be found here: https://github.com/utkarsh530/GPUODEBenchmarks/tree/main/paper_artifacts/data. You can use either Tesla V100 or RTX 5000 data points. The first column indicates the no. of trajectories, with the second column having times in milliseconds.

SciML / DiffEqGPU.jl

EnsembleMPGOS #77