Closed nikitaved closed 2 days ago
Nice! This change is needed to make my PR https://github.com/Lightning-AI/lightning-thunder/pull/214 work with CUDA Graphs correctly. Because there I try to put torch.autograd.Function.apply into the forward trace but it should be executed outside of the CUDA Graph-captured region.
Let's merge and fix anything that needs fixing later.
As per title. Fixes https://github.com/Lightning-AI/lightning-thunder/issues/635.
Also, it fixes the following subtle bugs:
CUDAGraphExecutor
- does not properly update static buffers when the same graph is invoked on inputs with meta-data that allows to fetch cached graphs, but with different storage data. The area of concern - training and the backward pass.horizontal_merge
in the fusion logic - that one, when grouping bound symbols, does not consider precedence between ops horizontally. It is not an issue with nvfusions, but it could cause issues when deciding whether to place something likedel x
afterop(x)
in a custom FusionExecutor. The fix sorts bsyms in each group wrt trace position (which is expected to be toposorted prior to any fusions) and, hence, restores the inter-/intra-bsym groups topological order.