CUDAGraphs as executor/transform/fusion pass

nikitaved commented 3 days ago

As per title. Fixes https://github.com/Lightning-AI/lightning-thunder/issues/635.

Also, it fixes the following subtle bugs:

CUDAGraphExecutor - does not properly update static buffers when the same graph is invoked on inputs with meta-data that allows to fetch cached graphs, but with different storage data. The area of concern - training and the backward pass.
horizontal_merge in the fusion logic - that one, when grouping bound symbols, does not consider precedence between ops horizontally. It is not an issue with nvfusions, but it could cause issues when deciding whether to place something like del x after op(x) in a custom FusionExecutor. The fix sorts bsyms in each group wrt trace position (which is expected to be toposorted prior to any fusions) and, hence, restores the inter-/intra-bsym groups topological order.

IvanYashchuk commented 2 days ago

Nice! This change is needed to make my PR https://github.com/Lightning-AI/lightning-thunder/pull/214 work with CUDA Graphs correctly. Because there I try to put torch.autograd.Function.apply into the forward trace but it should be executed outside of the CUDA Graph-captured region.

t-vi commented 2 days ago

Let's merge and fix anything that needs fixing later.

Lightning-AI / lightning-thunder

CUDAGraphs as executor/transform/fusion pass #656