CUDA transform of a NeVA graph

tfogal commented 1 month ago

🐛 Bug

Inside the CUDAGraphTransform to_arg_descriptor code, zip(*map(extract_descriptor, args)) is not getting the values it expects and this leads to a tuple unpacking issue.

Traceback (most recent call last):
  File "/home/tfogal/dev/nemo/./examples/multimodal/multimodal_llm/neva/neva_pretrain.py", line 484, in thunder_graph_backend4
    fqn(*args) # run once to force compilation
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1748, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tfogal/dev/thunder/thunder/core/module.py", line 81, in forward
    res = self._forward_fn(*args, **kwargs)
  File "/home/tfogal/dev/thunder/thunder/__init__.py", line 787, in fn_
    result = cache_entry.computation_fn(*inps)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
  File "thunder.computation_261", line 8, in computation
    [scale_t, getitem] = CUDAGraph1()
  File "/home/tfogal/dev/thunder/thunder/transforms/cudagraph.py", line 193, in callable
    return self.call_cuda_graph(fn_name, *args)
  File "/home/tfogal/dev/thunder/thunder/transforms/cudagraph.py", line 119, in call_cuda_graph
    cache_key = self.make_cache_key(fn_name, *args)
  File "/home/tfogal/dev/thunder/thunder/transforms/cudagraph.py", line 114, in make_cache_key
    return (fn_name, self.make_static_inputs_mask(fn_name, *args), to_arg_descriptor(*args))
  File "/home/tfogal/dev/thunder/thunder/transforms/cudagraph.py", line 36, in to_arg_descriptor
    dtypes, sizes, strides, non_tensor_args = zip(*map(extract_descriptor, args))
ValueError: not enough values to unpack (expected 4, got 0)

To Reproduce

Run the NeVA model (#343) model and apply the graph transform:

      if use_cuda_graph(thunder_graphs):
        xform = thunder.transforms.cudagraph.CUDAGraphTransform()
        fqn = thunder.jit(gm, transforms=[xform])
      else:
        fqn = thunder.jit(gm)
      fqn(*args) # run once to force compilation

use_cuda_graph just takes a counter of the graph we're seeing, and it returns true if it is graph 37 (at least for this bug).

Code sample

TomF to edit this in later

Expected behavior

Transform the program to use a graph[s]. Or throw a reasonable error about why that's not possible.

Environment

Using thunder 4550f6bf231885cb6745fd8d705357c32ad0f2d0.

$ nvidia-smi | grep "CUDA V"
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |

Additional context

We're not yet sure launch latency is on the critical path for NeVA; I am just experimenting to see. The impact a graph would have on launch latency could help confirm this.

t-vi commented 1 month ago

Thank you for the report, @tfogal

If args is empty, this will give an empty list: https://github.com/Lightning-AI/lightning-thunder/blob/50f587d6ff6c17a8f8392c57a0e6a73b0fe298fb/thunder/transforms/cudagraph.py#L36 and so unpacking fails.

We should handle empty args by special casing it and setting the for lhs items to empty tuples.

I'll send a PR.

t-vi commented 1 month ago

The other trouble we'll run into is that the autograd passes delete the marks of static inputs. This needs fixing and is on our list, but it is quite a refactor. #1138

t-vi commented 3 weeks ago

@tfogal with #1324 fixing the immediate issue, I would close this or assign it back to you for update. Sorry for taking long.

Lightning-AI / lightning-thunder