Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.15k stars 77 forks source link

NotImplementedError: requires_grad=True is not yet supported within thunder.compile #582

Closed mpatel31415 closed 2 months ago

mpatel31415 commented 3 months ago

🐛 Bug

There is not implemented error when running models:

Llama-3-70B falcon-180B longchat-13b-16k CodeLlama-34b-hf vicuna-7b-v1.5-16k Mixtral-8x7B-v0.1

with thunder_cudnn and thunder_inductor_cat_cudnn

An error occurred: NotImplementedError \xe2\x80\x93 requires_grad=True is not yet supported within thunder.compile
[rank0]: NotImplementedError: requires_grad=True is not yet supported within thunder.compile"

To Reproduce

Steps to reproduce the behavior: (example for one model)

mkdir -p output
docker run --pull=always --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864  -v $PWD/output:/output -it INTERNAL_IMAGE:pjnl-20240607

Run in the container:

torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name longchat-13b-16k --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2 

Expected behavior

The model should run or we should get OOM error.

Environment

As in the Docker image

tfogal commented 3 months ago

cc @kshitij12345 this sounds related to your recent comment on #332.

kshitij12345 commented 3 months ago

Hi,

I am unable to repro this with the mentioned pjnl-20240607 and the latest pjnl-20240613 docker images

In the container, I ran the command as mentioned above and it failed with OOM (on both images)

torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name longchat-13b-16k --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2

I also tried to run the same model with fewer layers with the following command and it worked fine (on both images)

torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name longchat-13b-16k --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2 --n_layer=15

Output of second command:

iter 40: loss 4.6250, iter time: 1200.55ms, t: 16384
iter 41: loss 4.6250, iter time: 1205.85ms, t: 16384
iter 42: loss 4.6250, iter time: 1217.30ms, t: 16384
iter 43: loss 4.6250, iter time: 1214.74ms, t: 16384
iter 44: loss 4.6250, iter time: 1210.70ms, t: 16384
Model name: longchat-13b-16k
Seq Length: 16384
Micro BS: 1
Global BS: 8
Number of Layers: 15
Number of parameters: 0.64B
Distributed Mode: fsdp
Sharding Mode: zero2
Bucketing: none
Compiler: thunder_cudnn
Average iter time: 1206.71 ms
Memory used: 71.28 GB
Tokens/s: 108582.29
Tokens/s/GPU: 13572.79
TFLOP/s: 4846.03
wprazuch commented 3 months ago

@kshitij12345 jumping in as a substitute for @mpatel31415 - I will verify the error on our side again, and let you know.

mruberry commented 3 months ago

triage review:

wprazuch commented 3 months ago

@mruberry Here is the full traceback of the error:

0: [rank0]: Traceback (most recent call last):
0: [rank0]:   File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 639, in <module>
0: [rank0]:     CLI(benchmark_main)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 96, in CLI
0: [rank0]:     return _run_component(components, cfg_init)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 196, in _run_component
0: [rank0]:     return component(**cfg)
0: [rank0]:   File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 584, in benchmark_main
0: [rank0]:     benchmark.train()
0: [rank0]:   File "/workspace/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 485, in train
0: [rank0]:     logits = self.model(input_ids)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1558, in _wrapped_call_impl
0: [rank0]:     return self._call_impl(*args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1567, in _call_impl
0: [rank0]:     return forward_call(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/module.py", line 60, in forward
0: [rank0]:     res = self._forward_fn(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/__init__.py", line 658, in fn_
0: [rank0]:     cache_entry, inps, pro_to_epi = get_computation_and_inputs(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/__init__.py", line 217, in cache_info_wrapper
0: [rank0]:     res = fn(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/__init__.py", line 496, in get_computation_and_inputs
0: [rank0]:     jit_results: TraceResults = interpreter(
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/__init__.py", line 205, in _general_frontend
0: [rank0]:     return thunder_general_jit(fn, args, kwargs, sharp_edges=sharp_edges, record_history=record_history)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/jit_ext.py", line 1581, in thunder_general_jit
0: [rank0]:     result = jfn(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6696, in fn_
0: [rank0]:     raise e
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6664, in fn_2
0: [rank0]:     return fn(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6061, in _impl
0: [rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1558, in _wrapped_call_impl
0: [rank0]:     return self._call_impl(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6061, in _impl
0: [rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1567, in _call_impl
0: [rank0]:     return forward_call(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6061, in _impl
0: [rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/litgpt/model.py", line 94, in forward
0: [rank0]:     x = block(x, cos, sin, mask, input_pos)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6061, in _impl
0: [rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1558, in _wrapped_call_impl
0: [rank0]:     return self._call_impl(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6061, in _impl
0: [rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1567, in _call_impl
0: [rank0]:     return forward_call(*args, **kwargs)
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6061, in _impl
0: [rank0]:     return fn.__func__(fn.__self__, *args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 168, in forward
0: [rank0]:     return self.checkpoint_fn(  # type: ignore[misc]
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6173, in partial_call_impl
0: [rank0]:     return partial_function.func(*(partial_function.args + args), **(partial_function.keywords | kwargs))
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 30, in inner
0: [rank0]:     return disable_fn(*args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 599, in _fn
0: [rank0]:     return fn(*args, **kwargs)
0: [rank0]: NotImplementedError: requires_grad=True is not yet supported within thunder.compile

We are executing only using benchmark_litgpt.py native compile function, which uses jit. I think the NotImplementedError error message is just outdated and confusingly mentions compile instead of jit.

@kshitij12345 I am trying to manually reproduce the errors, but I now receive the same OOM errors as you, which is very strange. I have no explanation why the errors differ from what we got inside our pipelines, as we use the same container to reproduce.

Maybe the full traceback will provide some value for you, but as of now, we cannot reproduce the error, so for me this can be closed, and if we find this error in the next round of regressions, which will happen next week, we could report it again. If that will be the case, we will make sure to reproduce it again manually to avoid needless work from your side. All the best, Voytec

wprazuch commented 3 months ago

@kshitij12345 with the new iteration cycle (pjnl-20240621), I was able to reproduce the issue:

torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name vicuna-7b-v1.5-16k --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2 
kshitij12345 commented 3 months ago

@wprazuch I tried to repro again with the new image - 20240621. But I see an OOM when I run the above command.

An error occurred: OutOfMemoryError – CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 72.00 MiB is free. Process 3354053 has 79.02 GiB memory in use. Of the allocated memory 77.42 GiB is allocated by PyTorch, and 182.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Also, looking a bit deeper at the stack trace from above comment - I see some usage of checkpointing (which IIRC is not supported in thunder).

AFAIK, benchmark_litgpt.py doesn't enable that, so I wonder how is it showing up in the stack trace.

0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 168, in forward
0: [rank0]:     return self.checkpoint_fn(  # type: ignore[misc]
0: [rank0]:   File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6173, in partial_call_impl
0: [rank0]:     return partial_function.func(*(partial_function.args + args), **(partial_function.keywords | kwargs))
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 30, in inner
0: [rank0]:     return disable_fn(*args, **kwargs)
0: [rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 599, in _fn
0: [rank0]:     return fn(*args, **kwargs)
0: [rank0]: NotImplementedError: requires_grad=True is not yet supported within thunder.compile
wprazuch commented 2 months ago

@kshitij12345 Finally we found the source of discrepancy...

The proper command is:

torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name vicuna-7b-v1.5-16k --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2 --checkpoint_activations True

--checkpoint_activations True needs to be set in order to reproduce the error. This is quite embarassing because our reproduction problems came from human error by skipping the necessary param setting.

We made necessary improvements in our pipeline to make sure such events won't happen in the future. We should have full correctness in future commands to reproduce in the next iterations, starting from now.

All the best, Voytec

wprazuch commented 2 months ago

Seems like it actually happens for any case when we use Thunder & activation checkpointing 🤔

kshitij12345 commented 2 months ago

Since, activation checkpointing not being supported in thunder is a known limitation

https://github.com/Lightning-AI/lightning-thunder/blob/a3e432f7174019b2eda85865890d5f7342a993c2/thunder/benchmarks/benchmark_litgpt.py#L214-L217

EDIT - I think we should either fail loudly here so that it is clear to the user that this is an invalid configuration currently (and we don't hit other errors because of this) I see that it used to be error and it was updated to be a warning in https://github.com/Lightning-AI/lightning-thunder/pull/559 to be able to find these errors. OR Have the patch below such that we actually disable activation checkpointing for thunder (even if it was passed)

if self.checkpoint_activations and "thunder" in self.compile: 
     warnings.warn( 
         "Activations checkpointing is configured, but Thunder does not support checkpointing. Checkpointing will be ignored." 
     )
     self.checkpoint_activations = False

@IvanYashchuk, as of now, do we have a plan to tackle activation checkpointing with thunder in the nearby future?

IvanYashchuk commented 2 months ago

No, there's no concrete plan yet.

Great find, Kshiteej and Wojciech! Yes, apply_activation_checkpointing shouldn't be used on a nn.Module that is passed to thunder.jit. Your patch should be added to the script. In general, the PyTorch nn.Module that Thunder is asked to accelerate should be a plain one without any wrappers.

kshitij12345 commented 2 months ago

Fixed in #769