NVIDIA / jitify

A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).
BSD 3-Clause "New" or "Revised" License
516 stars 64 forks source link

load_program() performance (with large include hierarchies) #90

Open Robadob opened 3 years ago

Robadob commented 3 years ago

When I include the main glm header within an RTC source, the runtime of load_program() increases to over 60 seconds (from what was already ~10 seconds), so I started investigating the problem.

I discovered that to discover include files, Jitify calls nvrtcCompileProgram(), catches the failure and parses the missing include from the error log, and then repeats the process until compilation succeeds or fails for a different reason.

GLM's include hierarchy adds ~109 separate glm header files, and we have ~20 internal library headers which are included too (although we recently dynamically pruned 20 down to ~14 for a small improvement). (I calculated these values with output from pcpp so they might be a little off as I haven't tested the flattened file it output)

The problem is, that each call to nvrtcCompileProgram() causes it to reparse the include hierarchy, so the cost grows from an initial ~100ms, to ~600ms as each header is added. I logged it doing 198 failed calls to nvrtcCompileProgram(). This is highly inefficient, leading to the load_program() function taking 60+ seconds with the final successful nvrtc call taking around 1 second.

In comparison pcpp, was able to pre-process the full include hierarchy in 3 seconds. So it's fair to assume it could theoretically be compiled in 4 seconds with appropriate include pre-processing, 15x faster.

In our use-case, we perform multiple individual RTC compilations with the same include hierarchy, so we have this unnecessary cost. Worst case being our test suite, 85 minutes total with glm included in all agent functions, 25 minutes total with glm only included where used and 11 minutes with glm tests and include disabled. But even in that case, probably 10 minutes are spent doing RTC for the small number of RTC tests (we have most the RTC tests in our python test suite).

I'd be interested to know your thoughts on whether Jitify can address this, or even whether NVRTC could be extended by the team who develops it to actually load include files from disk (given implementing a full pre-processor within Jitify is impractical) and perhaps also mimic Jitify's missing include ignoring behaviour.

The optimal performance in our case (FLAMEGPU2) would probably be achieved by pre-processing and flattening the header hierarchy that we pass to all RTC sources (to even reduce the ~3s of header parse/loading each time), but the high compile costs due to include processing might be scaring other users away from using RTC if they're finding it via Jitify and building things with larger include hierarchies (In our case, we've always had the ~20 library headers and thought the ~10s expense was just an unfortunate byproduct of RTC for our non-simple use-case).

Irrespective of NVRTC, Jitify might be able to improve the cost of multiple compiles, by caching loaded headers and always passing them to NVRTC regardless of if they're required? This would still leave the 1st compilation costly, but would make any subsequent compilations with the same headers much cheaper.


As I quick test, I was able to flatten the majority of our includes using pcpp, and this reduces the number of false calls to nvrtcCompileProgram() from 198 down to 32. I believe these remaining 32, are due to our dynamic header, which has a delayed include which can't be flattened the same, and the various system includes which I haven't flattened yet (there are around 60, but I think they're mostly repeats). But already the RTC time was down to ~8 seconds.

benbarsdell commented 3 years ago

Thanks for this feedback, it's an important issue.

The situation is a bit tricky. At its heart is the fact that having NVRTC load headers implicitly from the filesystem (which it does support) is inflexible in the following ways:

1) Headers can only be loaded from the filesystem, not from Jitify's builtin header strings, user-embedded strings, or callbacks like Jitify currently supports. 2) Headers cannot be hot-patched before they're compiled (Jitify does this to apply some workarounds). 3) Missing headers cause compiler errors (Jitify currently treats them as warnings instead, although we will probably change this in future).

For these reasons, our current approach is to identify and pre-load all headers and pass them to NVRTC explicitly. This gives us full control over how headers are handled, but requires a means of identifying the required headers. As you observed, the current technique to do this (iteratively recompiling and parsing the error messages) is very inefficient.

There are two general ways we could go about solving this:

a) Implement a more efficient method for identifying header dependencies. b) Abandon the explicit header management approach and just let NVRTC load them instead.

I've been thinking about (a) for a little while and there might be some things we can do, but it might take some time.

On the other hand, it may be time to reconsider using (b). E.g., we could:

We'll think about these ideas. Let us know if you have any additional feedback.

Another thing to consider is that we've been moving toward a new workflow where all sources and headers are pre-processed offline and shipped with the application. I'd be interested to hear if such a workflow would work well for your application or not. (It does require that all the headers have redistributable licenses).

Robadob commented 3 years ago

Thanks, I'm glad this is something you've thought about.

a) Implement a more efficient method for identifying header dependencies.

This was my initial thought, but as I thought about it more it simply seems impractical. To correctly identify all header dependencies, you would want a full C pre-processor. Going shorter than this would create edge cases, where it's either too aggressive/soft (finding unwanted headers, or missing wanted headers) which are bound to break a small amount of header libraries people may want to (try and) use (e.g. supporting GLM took a few changes to both libraries).

Writing your own C pre-processor seems like a bad idea, full support would include parsing and handling things like arithmetic and bitwise operators, and it just creates a new complex thing to support and maintain. Similarly, leaning on a 3rd party pre-processor, would probably break from your current ethos of 'Jitify is a single-file header library'. Which is what lead to my thought of leaning on NVRTC (your Option B), which presumably uses a fork of Clang's pre-processor similar to NVCC.


b) Abandon the explicit header management approach and just let NVRTC load them instead.

  1. Headers can only be loaded from the filesystem, not from Jitify's builtin header strings, user-embedded strings, or callbacks like Jitify currently supports.

We currently generate 1 dynamic header, which is not on the file-system, so this would create a problem for us. However, I expect we could find a way to prepend that header to the user's provided source (similar to how we already prepend the required includes).

  • Use your suggestion of simply passing all of Jitify's internal headers to NVRTC regardless of whether they're needed (probably not a problem?),

Yes I hope this is possible, even if it means storing them on disk and specifying the directory as an include dir. Worst case you could potentially raise it with the NVRTC team to find a compromise?, e.g. load from in-memory and disk, giving in-memory higher priority.

  • Abandon the load-from-callback feature (I'm OK with this),

I'm not familiar with this feature, so we've probably not had reason to use it in our use-case. But if it's what I'm assuming, then dropping support would move usage closer to a traditional compiler, which seems reasonable.

  • Abandon treating missing headers as warnings (I'm OK with this), and
  • Abandon hot-patching headers (maybe doable).

Are these not the same thing? e.g. Jitify see's a missing header and comments it out and tries again? Warnings/errors about missing headers then aren't visible (by default?) unless Jitify's compilation loop reaches a fail state.

This is possibly the difficult point, but I presume existing behaviour could be maintained by instead providing NVRTC an empty string/file, for the missing header, and repeating compilation similar to the existing loop.

It could then also be useful to clearly document a method for a user to specify/dummy missing headers ahead of time, to allow users with a fixed header stack to streamline compilation. These dummied headers could also optionally be cached, so subsequent compilations during runtime attempt to use them. (e.g. in our use-case, we explicitly keep everything as separate compilation units as it allows us to dynamically generate 1 header which greatly improves performance, even over non-RTC builds).


Another thing to consider is that we've been moving toward a new workflow where all sources and headers are pre-processed offline and shipped with the application. I'd be interested to hear if such a workflow would work well for your application or not. (It does require that all the headers have redistributable licenses).

Pre-processing source offline, seems antithetical to the purpose of runtime compilation. Surely, if you're shipping a compiled binary, you might as well use NVCC, the only benefit being the user doesn't additionally require gcc/visual-studio?

In our main use case for RTC, users will be implementing their own model via our Python interface by writing agent functions in CUDA. So they would be writing/developing/running the CUDA source from within Python, rather than running any CUDA we explicitly provide to them, they simply use the API we provide which is what's within the header hierarchy. So a pre-process stage seems redundant, because they would need to be doing that directly before running it. We do already cache the RTC ptx(?) to disk, so repeated runs with unchanged source can bypass RTC, but in development recompilation is regularly required.

As I said in my original post, it's possible we could pre-process (or rather flatten) most of our includes. The only external (non NV) headers we use are GLM, which are under an MIT license. If pre-processing were to include compilation though, we'd probably avoid that as our headers are heavily templated, and ideally fully inlined at compile time.

Not sure this is relevant, but it's possible we will have users who wish to include additional header libraries, for some advanced math functionality or similar. But that's probably an extreme edge-case at this current point (and they'd probably run into difficulties, similar to those I had getting GLM to work).

benbarsdell commented 3 years ago

We currently generate 1 dynamic header, which is not on the file-system, so this would create a problem for us.

This will still be possible by providing the header explicitly.

e.g. load from in-memory and disk, giving in-memory higher priority.

NVRTC already works like this actually. There are some potential issues with name collisions but apart from that it should work.

if you're shipping a compiled binary

It doesn't ship a compiled binary, it just ships all of the source and headers along with (e.g., embedded in) the application. The offline preprocessing step just does what preprocessing currently does: identifies and loads (and patches) all of the required headers and packages them together (similar to flattening) for shipping with the application. You can still add extra headers and instantiate templates at runtime.

users who wish to include additional header libraries

We can probably support this but I'll have to test it to make sure.

leofang commented 3 years ago

Hi @Robadob, I also identified this performance issue in CuPy a while ago. If you're still on Jitify v1, and if the slowdown you're seeing happens every time a new kernel is being compiled but most of the headers it sees are identical (i.e. your headers can be cached and reused), then I figured out a workaround that @benbarsdell merged for me: #82.

To understand our workaround requires a bit of context: CuPy already heavily uses NVRTC, probably before Jitify started, so all we need is Jitify's compile-fail-search loop and its (amazing) patches to all the C++ std headers; we don't need the rest of its functionalities like compilation and kernel cache, for example, as we have them all already.

With this in mind, our workaround uses Jitify's internal function directly, and caches all the headers that Jitify found so that we only pay the search cost once. Future compilations will just reuse the searched headers. In fact, based on your description, I think you can just preload all of your headers, which would even allow you to not pay the first time cost.

If you're interested, the relevant code in CuPy are here: https://github.com/cupy/cupy/blob/master/cupy/cuda/jitify.pyx https://github.com/cupy/cupy/blob/7ae60a6139586cc1b12bf9674d0a7d6f10d127ed/cupy/cuda/compiler.py#L218 (Sorry if you're not a Python/Cython programmer. I am extremely overwhelmed this and next week so I can't explain more to you, but I'm happy to chat offline after that.)

I've been wondering if the same strategy also applies to Jitify v2, but unfortunately I likely do not have time to look into this for now.

Robadob commented 3 years ago

@benbarsdell

It doesn't ship a compiled binary, it just ships all of the source and headers along with (e.g., embedded in) the application. The offline preprocessing step just does what preprocessing currently does: identifies and loads (and patches) all of the required headers and packages them together (similar to flattening) for shipping with the application. You can still add extra headers and instantiate templates at runtime.

This sounds like it could work for us then, just a case of working out how we'd fit it into our build system. (Not likely to be a problem, would just mean extending the CMake scripts)

We can probably support this but I'll have to test it to make sure.

As I said, it's a possibility, but far from a priority to us.

@leofang

Your workaround makes sense, and it would probably be more straight forward for us to apply. However as you identify, we would still have the ~60s compilation of the first unit. So this would be great for our test suite, and large models with many units to be compiled, but still somewhat prohibitive for smaller models with only a small number of units to compile.

We're still using Jitify 1, and interface with it via C++ (our Python interface is mostly generated by SWIG). But the code you've provided will be useful if we decide to go this route.


Thanks both of you, I'll discuss these options with my team in the coming weeks.

Robadob commented 2 years ago

So I started looking at this again today. I can't make much sense of #82, but from leo's other sources it appears they're using Jitify differently to us (we construct a jitify::experimental::Program object).

However, I did find that if we specify expected header file names (based on the keys from _sources within jitify::experimental::Program), it can significantly improve compilation times. As an example, I was simply running our test suite, here's the timing improvements for the first 3 RTC tests.

I've not looked at the tests in detail, but I would assume the first compiles a single kernel, and the second perhaps four.

The 3rd test however uses glm, the initial reason I identified this problem. Internally, GLM uses relative paths to it's includes. This causes the above mentioned _sources map to include duplicates of a bunch of the headers, where they are included by files from different sub directories. Similarly, I doubt I can specify the matching relative paths to optimise this load, without specifying each of the subdirectories of the include hierarchy as an include dir, which seems like a bad idea.

Not sure I'll continue playing with this next week, was just a quick test of something that came up in discussion with someone from my team.