Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.12k stars 69 forks source link

`CUDAGraphExecutor` - limited to static graphs only #433

Open nikitaved opened 3 months ago

nikitaved commented 3 months ago

🐛 Bug

With https://github.com/Lightning-AI/lightning-thunder/pull/430 merged we enable the usage of thunder.jit(..., use_cudagraphs=True). This makes sure that the forward callable is wrapped into thunder.cudagraphs.CUDAGraphExecutor. This executor, however, assumes a static structure of the code. We might consider torch.cuda.make_graphed_callables as a safer option.

IvanYashchuk commented 2 months ago

It's an inherent feature of CUDA Graphs to be restricted to static code. Newer CUDA Toolkit versions have dynamic control flow feature, but it's unavailable to use in PyTorch for now https://developer.nvidia.com/blog/dynamic-control-flow-in-cuda-graphs-with-conditional-nodes/ What additional safety does torch.cuda.make_graphed_callables provide?

nikitaved commented 2 months ago

No additional safety, I got it wrong. What I mean is that we can indeed make it a transform to potentially decide which parts are safe to capture.

tfogal commented 1 month ago

triage review:

nikitaved commented 1 month ago

@tfogal , the executor is in good shape, just not complete. This means, there is no "advanced" logic on handling data-dep operations and fusion regions between graph breaks (with dynamic shapes, sometimes we can put a tensor into a fusion, sometimes we have to opt-out). This bit was intentionally left in this state to collect issues and shape our understanding of how to handle them in our use cases... To summarize, the original issue is still present, but we can fix it, at least partially, by modifying the fusion logic in the current executor.