Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.16k stars 76 forks source link

Different shapes, values of model weights and losses between FSDP training in Eager mode and with Thunder #866

Open mpatel31415 opened 2 months ago

mpatel31415 commented 2 months ago

🐛 Bug

After training Llama-3-8b on 8 A100 for 10 iterations with eager mode I printed the model weights:

torch_dist.barrier()
weights_after_training = benchmark.model.lm_head.weight[:10].data.to(device="cpu", dtype=torch.float32).numpy()
if global_rank in [0, None]:
    print(f"WEIGHTS:\n{weights_after_training.shape}\n{weights_after_training}")

when not using Thunder I got:

WEIGHTS: (10,) [ 0.01855469 0.00598145 0.01312256 0.01300049 0.00262451 0.0055542 -0.01104736 0.00076294 0.01202393 -0.00909424]

when using Thunder I got:

WEIGHTS: (10, 4096) [[ 0.01281738 0.00582886 0.01342773 ... -0.01196289 -0.00369263 -0.01287842] [-0.00331116 0.01647949 -0.01452637 ... -0.01696777 -0.00650024 -0.00145721] [ 0.01867676 0.00334167 0.00133514 ... -0.00531006 -0.00744629 0.01147461] ... [-0.01019287 -0.00939941 0.00204468 ... 0.01184082 0.00201416 -0.01104736] [-0.00643921 0.00318909 0.01623535 ... -0.00148773 0.01153564 -0.01086426] [-0.00921631 -0.01452637 0.01586914 ... -0.01330566 0.00445557 0.00692749]]

So the shape and values are different. I checked and executing the training script multiple times gives consistent results (so it's not a problem with randomness).

To Reproduce

  1. Start container by running:

    docker run --pull=always --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it INTERNAL_IMAGE:pjnl-20240724
  2. Later create in the container training script from the file linked to this issue or add in line 584 of the script lightning-thunder/thunder/benchmarks/benchmark_litgpt.py 4 lines of code from the bug description.

  3. Assuming the newly created script is called benchmark_litgpt.py call:

    • For Eager
      torchrun --standalone --max-restarts=0 --nproc-per-node=8  benchmark_litgpt.py  --model_name Llama-3-8B --max_iters 10 --warmup_iters 2 --distributed_mode fsdp --shard_mode zero3 --bucketing_mode block &> file_eager_1.txt
    • For Thunder
      torchrun --standalone --max-restarts=0 --nproc-per-node=8  benchmark_litgpt.py  --model_name Llama-3-8B --max_iters 10 --warmup_iters 2 --distributed_mode fsdp --shard_mode zero3 --bucketing_mode block --compile thunder &> file_thunder_1.txt

The results will be visible in file_eager_1.txt and file_thunder_1.txt

Expected behavior

Shapes and values of the model weights should be the same

Environment

Output from nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:07:00.0 Off |                    0 |
| N/A   30C    P0             60W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:0F:00.0 Off |                    0 |
| N/A   29C    P0             58W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  |   00000000:47:00.0 Off |                    0 |
| N/A   29C    P0             58W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  |   00000000:4E:00.0 Off |                    0 |
| N/A   31C    P0             62W /  400W |    2757MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  |   00000000:87:00.0 Off |                    0 |
| N/A   33C    P0             59W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  |   00000000:90:00.0 Off |                    0 |
| N/A   33C    P0             62W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  |   00000000:B7:00.0 Off |                    0 |
| N/A   33C    P0             62W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  |   00000000:BD:00.0 Off |                    0 |
| N/A   33C    P0             60W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

Python packages:

lightning 2.3.3 lightning-thunder 0.2.0.dev0
lightning-utilities 0.11.6 litgpt 0.4.5 nvfuser 0.2.8+gitfa2bedc
nvidia-cudnn-frontend 1.5.2 nvidia-pyprof 3.11.0 pytorch-lightning 2.3.3 torch 2.5.0a0+git16a2a1a torchmetrics 1.4.0.post0 torchvision 0.19.0a0+d23a6e1

Additional context

(This is py file, but to attach it here I had to change the extension) benchmark_litgpt.txt

cc @carmocca @crcrpar

t-vi commented 2 months ago

This is expected, as we leave the provided tensors alone (but will change the recommended init scheme), please use benchmark.model.get_parameter('lm_head.weight')[:10] to get the sharded weights.

mpatel31415 commented 2 months ago

I tested it and benchmark.model.get_parameter('lm_head.weight')[:10] still gives shape [10, 4096] for Thunder and [10] for Eager. Also it is expected that the values of parameters are different between Thunder and Eager?

t-vi commented 2 months ago

Is model the original model or the thunder module?

mpatel31415 commented 2 months ago

In case of Thunder it's Thunder module: thunder.core.module.ThunderModule. IN case of Eager it's the original module.

mpatel31415 commented 2 months ago

The value of loss is also different between Thunder and Eager: Eager:

iter 0: loss 11.9375, iter time: 6618.87ms, t: 8192 iter 1: loss 9.8750, iter time: 1466.43ms, t: 8192 iter 2: loss 5.9375, iter time: 1097.02ms, t: 8192 iter 3: loss 4.8125, iter time: 1096.80ms, t: 8192 iter 4: loss 4.6875, iter time: 1093.69ms, t: 8192 iter 5: loss 4.6875, iter time: 1098.55ms, t: 8192 iter 6: loss 4.6562, iter time: 1096.84ms, t: 8192 iter 7: loss 4.6250, iter time: 1098.75ms, t: 8192 iter 8: loss 4.6562, iter time: 1186.29ms, t: 8192 iter 9: loss 4.6562, iter time: 1106.99ms, t: 8192

Thunder

iter 0: loss 11.8750, iter time: 73451.35ms, t: 8192 iter 1: loss 9.5000, iter time: 993.84ms, t: 8192 iter 2: loss 5.7188, iter time: 1012.46ms, t: 8192 iter 3: loss 4.8125, iter time: 1013.92ms, t: 8192 iter 4: loss 4.6875, iter time: 1003.81ms, t: 8192 iter 5: loss 4.7188, iter time: 1015.76ms, t: 8192 iter 6: loss 4.6875, iter time: 1002.12ms, t: 8192 iter 7: loss 4.6562, iter time: 999.51ms, t: 8192 iter 8: loss 4.6562, iter time: 1006.47ms, t: 8192 iter 9: loss 4.6562, iter time: 1012.25ms, t: 8192

When I did training on some real data the difference is much larger: Eager

12:30:08 | Iteration 0: loss 11.9375, time: 5652.83ms 12:30:09 | Iteration 1: loss 11.5000, time: 1446.79ms 12:30:10 | Iteration 2: loss 10.8125, time: 1078.02ms 12:30:11 | Iteration 3: loss 9.0625, time: 1082.69ms 12:30:12 | Iteration 4: loss 8.7500, time: 1084.27ms 12:30:13 | Iteration 5: loss 8.3125, time: 1083.96ms 12:30:14 | Iteration 6: loss 8.3750, time: 1080.23ms 12:30:15 | Iteration 7: loss 8.0625, time: 1081.94ms 12:30:17 | Iteration 8: loss 8.8750, time: 1079.18ms 12:30:18 | Iteration 9: loss 8.3750, time: 1081.14ms 12:30:19 | Iteration 10: loss 7.8438, time: 1084.09ms

Thunder

12:43:51 | Iteration 0: loss 1.0312, time: 74820.20ms 12:43:52 | Iteration 1: loss 13.8750, time: 910.84ms 12:43:52 | Iteration 2: loss 27.2500, time: 920.43ms 12:43:53 | Iteration 3: loss 10.8125, time: 926.16ms 12:43:54 | Iteration 4: loss 9.3125, time: 929.09ms 12:43:55 | Iteration 5: loss 8.2500, time: 921.80ms 12:43:56 | Iteration 6: loss 7.7500, time: 920.07ms 12:43:57 | Iteration 7: loss 7.6562, time: 921.99ms 12:43:58 | Iteration 8: loss 8.3750, time: 923.90ms 12:43:59 | Iteration 9: loss 8.1250, time: 921.77ms 12:44:00 | Iteration 10: loss 7.7188, time: 928.60ms

When I change the seed for Eager mode I get different initial values of loss however they are still oscillating around 10 and the variability is not as large as for Thunder (loss = 1 or 27).

mruberry commented 2 months ago

fyi @IvanYashchuk

IvanYashchuk commented 2 months ago

I will look into this.