`CUDAGraphExecutor` - limited to static graphs only

Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.

Apache License 2.0

1.12k stars 69 forks source link

`CUDAGraphExecutor` - limited to static graphs only #433

Open nikitaved opened 3 months ago

nikitaved commented 3 months ago

🐛 Bug

With https://github.com/Lightning-AI/lightning-thunder/pull/430 merged we enable the usage of thunder.jit(..., use_cudagraphs=True). This makes sure that the forward callable is wrapped into thunder.cudagraphs.CUDAGraphExecutor. This executor, however, assumes a static structure of the code. We might consider torch.cuda.make_graphed_callables as a safer option.

IvanYashchuk commented 2 months ago

It's an inherent feature of CUDA Graphs to be restricted to static code. Newer CUDA Toolkit versions have dynamic control flow feature, but it's unavailable to use in PyTorch for now https://developer.nvidia.com/blog/dynamic-control-flow-in-cuda-graphs-with-conditional-nodes/ What additional safety does torch.cuda.make_graphed_callables provide?

nikitaved commented 2 months ago

No additional safety, I got it wrong. What I mean is that we can indeed make it a transform to potentially decide which parts are safe to capture.

tfogal commented 1 month ago

triage review:

Nik's new executor has allow list / deny list that tags whether an op is cuda graph-able
@nikitaved what's the current status of the executor?
priority here: we have issues with prologue latency today; cuda graphs could be one solution here. we do not have a specific use case that is blocked on latency, yet, but it seems likely to come up in the medium term. graphs are important for scale out, and helps mitigate jitter problems. we have a couple things ahead of this, priority-wise (getting NeMo model to work on training side, matmul quantization on inference side), but we expect this will be a top priority after that

nikitaved commented 1 month ago

@tfogal , the executor is in good shape, just not complete. This means, there is no "advanced" logic on handling data-dep operations and fusion regions between graph breaks (with dynamic shapes, sometimes we can put a tensor into a fusion, sometimes we have to opt-out). This bit was intentionally left in this state to collect issues and shape our understanding of how to handle them in our use cases... To summarize, the original issue is still present, but we can fix it, at least partially, by modifying the fusion logic in the current executor.