Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.13k stars 69 forks source link

Devices were expected to be the same, but got ... when Benchmarking Gemma-7b with Micro Batch Size 1 #941

Closed mjmikulski closed 3 days ago

mjmikulski commented 1 month ago

🐛 Bug

When using benchmark_litgpt Following error occurs:

Exception: Unexpected error occurred for {\'--micro_batch_size\': 1} due to [W803 20:39:12.165884696 socket.cpp:752] [c10d] The client socket has failed to connect to [eos0157.eos.clusters.nvidia.com]:59504 (errno: 22 - Invalid argument).
[rank4]: RuntimeError: Devices were expected to be the same, but got devices thunder.devices.Device(type=\'cuda\', index=4) and thunder.devices.Device(type=\'cpu\')!

To Reproduce

Start interactive job on a cluster:

srun -A YOUR_SLURM_ACCOUNT -J YOUR_SLURM_ACCOUNT-thunder.lit-gpt -N1 -p batch --container-image=INTERNAL_IMAGE:pjnl-20240801 --pty bash

Then execute:

torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
    --model_name Gemma-7b \
    --distributed_mode fsdp \
    --shard_mode zero2 \
    --compile thunder_cudnn \
    --checkpoint_activations False \
    --low_precision_mode fp8-delayed-te \
    --save_logs_for_all_batches True \
    --micro_batch_size 1

Expected Behavior

When running the benchmark for the Gemma-7b model with the specified configurations, the benchmarking scripts should successfully execute without any errors.

Environment

system.device_product_name                      DGXH100
system.gpu_driver_version                    535.129.03
libraries.cuda                               12.6.0.021
libraries.pip.lightning               2.4.0.dev20240728
libraries.pip.lightning-thunder              0.2.0.dev0
libraries.pip.lightning-utilities                0.11.6
libraries.pip.litgpt                              0.4.7
libraries.pip.nvfuser                  0.2.8+git671171f
libraries.pip.pytorch-lightning                   2.3.3
libraries.pip.torch                  2.5.0a0+gita94e507
libraries.pip.torchmetrics                  1.4.0.post0
libraries.pip.torchvision              0.19.0a0+d23a6e1

Additional context

Comment: seems like a thunder bug somewhere; something is somehow ending up on the CPU even though it shouldn’t be.

kiya00 commented 1 month ago

It seems I didn't reproduce this error on H100 80GB, instead I got OOM (I removed the --save_logs_for_all_batches True)

container: pjnl-20240801
lightning-thunder      0.2.0.dev0          /opt/pytorch/lightning-thunder
nvfuser                0.2.8+git671171f    /opt/pytorch/nvfuser
root@803c226ee238:/opt/pytorch/lightning-thunder# torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py     --model_name Gemma-7b     --distributed_mode fsdp     --shard_mode zero2     --compile thunder_cudnn     --checkpoint_activations False     --low_precision_mode fp8-delayed-te     --micro_batch_size 1
W0808 13:28:05.563000 931 torch/distributed/run.py:793]
W0808 13:28:05.563000 931 torch/distributed/run.py:793] *****************************************
W0808 13:28:05.563000 931 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0808 13:28:05.563000 931 torch/distributed/run.py:793] *****************************************
Loading model with {'name': 'Gemma-7b', 'hf_config': {'org': 'google', 'name': 'gemma-7b'}, 'scale_embeddings': True, 'block_size': 4096, 'vocab_size': 256000, 'padding_multiple': 64, 'padded_vocab_size': 256000, 'n_layer': 28, 'n_head': 16, 'head_size': 256, 'n_embd': 3072, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 16, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'GemmaMLP', 'gelu_approximate': 'tanh', 'intermediate_size': 24576, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 256}
Time to instantiate model: 0.02 seconds.
...
[rank6]: Traceback (most recent call last):
[rank6]:   File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 628, in <module>
[rank6]:     CLI(benchmark_main)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 96, in CLI
[rank6]:     return _run_component(components, init)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank6]:     return component(**cfg)
[rank6]:   File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 583, in benchmark_main
[rank6]:     benchmark.train()
[rank6]:   File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 495, in train
[rank6]:     loss.backward()
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 522, in backward
[rank6]:     torch.autograd.backward(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 346, in backward
[rank6]:     _engine_run_backward(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 812, in _engine_run_backward
[rank6]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 6 has a total capacity of 79.10 GiB of which 1.03 GiB is free. Process 2592558 has 78.06 GiB memory in use. Of the allocated memory 75.47 GiB is allocated by PyTorch, and 674.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...
mjmikulski commented 1 month ago

Thank you for the answer. How many GPUs did you use? AFAIK this was executed on a single node with 8 GPUs (H100).

kiya00 commented 1 month ago

Yes, I used 1 node with 8 H100(80G). Has anyone else tried it to see if it's reproducible?

mpatel31415 commented 3 weeks ago

We see this error in recent runs as well. I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add --n_layers 1 flag to reduce memory usage? Here is the command I used with image from 20240814

torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py     --model_name Gemma-7b     --distributed_mode fsdp     --shard_mode zero2     --compile thunder_cudnn     --checkpoint_activations False     --low_precision_mode fp8-delayed-te     --micro_batch_size 1 --n_layers 1
tfogal commented 3 weeks ago

I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add --n_layers 1 flag to reduce memory usage?

With n_layers 1 can this reproduce with 1 GPU? probably easier to find 1 H100 rather than a whole node with 8.

mpatel31415 commented 3 weeks ago

Yes, the same error is present when running: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --compile thunder_cudnn --low_precision_mode fp8-delayed-te --micro_batch_size 1 --n_layers 1

IvanYashchuk commented 2 weeks ago

@kiya00, could you please take a look at this problem and tell what's needed for a fix?

nvMelissa commented 2 weeks ago

Per triage meeting on 8/26, moved to @t-vi and assigned priority as P2.

kiya00 commented 2 weeks ago

It can be reproduced in container pjnl-20240830-mixology_70d843cd but not pjnl-20240830

kiya00 commented 1 week ago

A minimum reproduce is:

import torch, thunder
def fun(x):
    x = x * torch.tensor(0.5, dtype=x.dtype)
    return x
x = torch.randn((2,2),dtype=torch.bfloat16).cuda()
# print(fun(x))
jfun=thunder.jit(fun)
jfun(x)

Torch can run cuda tensor * cpu scalar tensor, but Thunder can't

IvanYashchuk commented 1 week ago

The problem can also be fixed by modifying the LitGPT code here https://github.com/Lightning-AI/litgpt/blob/1d37f9a99bb4ba2b7373bc7fc5b8c5a457af48df/litgpt/model.py#L95

+ x = x * torch.tensor(self.config.n_embd**0.5, dtype=x.dtype, x.device)
- x = x * torch.tensor(self.config.n_embd**0.5, dtype=x.dtype)

2 months ago this line used a Python scalar and that's why it was working:

x = x * (self.config.n_embd**0.5)