Closed mjmikulski closed 3 days ago
It seems I didn't reproduce this error on H100 80GB, instead I got OOM (I removed the --save_logs_for_all_batches True
)
container: pjnl-20240801
lightning-thunder 0.2.0.dev0 /opt/pytorch/lightning-thunder
nvfuser 0.2.8+git671171f /opt/pytorch/nvfuser
root@803c226ee238:/opt/pytorch/lightning-thunder# torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --distributed_mode fsdp --shard_mode zero2 --compile thunder_cudnn --checkpoint_activations False --low_precision_mode fp8-delayed-te --micro_batch_size 1
W0808 13:28:05.563000 931 torch/distributed/run.py:793]
W0808 13:28:05.563000 931 torch/distributed/run.py:793] *****************************************
W0808 13:28:05.563000 931 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0808 13:28:05.563000 931 torch/distributed/run.py:793] *****************************************
Loading model with {'name': 'Gemma-7b', 'hf_config': {'org': 'google', 'name': 'gemma-7b'}, 'scale_embeddings': True, 'block_size': 4096, 'vocab_size': 256000, 'padding_multiple': 64, 'padded_vocab_size': 256000, 'n_layer': 28, 'n_head': 16, 'head_size': 256, 'n_embd': 3072, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 16, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'GemmaMLP', 'gelu_approximate': 'tanh', 'intermediate_size': 24576, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 256}
Time to instantiate model: 0.02 seconds.
...
[rank6]: Traceback (most recent call last):
[rank6]: File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 628, in <module>
[rank6]: CLI(benchmark_main)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 96, in CLI
[rank6]: return _run_component(components, init)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank6]: return component(**cfg)
[rank6]: File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 583, in benchmark_main
[rank6]: benchmark.train()
[rank6]: File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 495, in train
[rank6]: loss.backward()
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 522, in backward
[rank6]: torch.autograd.backward(
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 346, in backward
[rank6]: _engine_run_backward(
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 812, in _engine_run_backward
[rank6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 6 has a total capacity of 79.10 GiB of which 1.03 GiB is free. Process 2592558 has 78.06 GiB memory in use. Of the allocated memory 75.47 GiB is allocated by PyTorch, and 674.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...
Thank you for the answer. How many GPUs did you use? AFAIK this was executed on a single node with 8 GPUs (H100).
Yes, I used 1 node with 8 H100(80G). Has anyone else tried it to see if it's reproducible?
We see this error in recent runs as well. I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add --n_layers 1
flag to reduce memory usage?
Here is the command I used with image from 20240814
torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --distributed_mode fsdp --shard_mode zero2 --compile thunder_cudnn --checkpoint_activations False --low_precision_mode fp8-delayed-te --micro_batch_size 1 --n_layers 1
I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add --n_layers 1 flag to reduce memory usage?
With n_layers 1
can this reproduce with 1 GPU? probably easier to find 1 H100 rather than a whole node with 8.
Yes, the same error is present when running:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --compile thunder_cudnn --low_precision_mode fp8-delayed-te --micro_batch_size 1 --n_layers 1
@kiya00, could you please take a look at this problem and tell what's needed for a fix?
Per triage meeting on 8/26, moved to @t-vi and assigned priority as P2.
It can be reproduced in container pjnl-20240830-mixology_70d843cd
but not pjnl-20240830
A minimum reproduce is:
import torch, thunder
def fun(x):
x = x * torch.tensor(0.5, dtype=x.dtype)
return x
x = torch.randn((2,2),dtype=torch.bfloat16).cuda()
# print(fun(x))
jfun=thunder.jit(fun)
jfun(x)
Torch can run cuda tensor * cpu scalar tensor
, but Thunder can't
The problem can also be fixed by modifying the LitGPT code here https://github.com/Lightning-AI/litgpt/blob/1d37f9a99bb4ba2b7373bc7fc5b8c5a457af48df/litgpt/model.py#L95
+ x = x * torch.tensor(self.config.n_embd**0.5, dtype=x.dtype, x.device)
- x = x * torch.tensor(self.config.n_embd**0.5, dtype=x.dtype)
2 months ago this line used a Python scalar and that's why it was working:
x = x * (self.config.n_embd**0.5)
🐛 Bug
When using
benchmark_litgpt
Following error occurs:To Reproduce
Start interactive job on a cluster:
Then execute:
Expected Behavior
When running the benchmark for the Gemma-7b model with the specified configurations, the benchmarking scripts should successfully execute without any errors.
Environment
Additional context
Comment: seems like a thunder bug somewhere; something is somehow ending up on the CPU even though it shouldn’t be.