Open mpatel31415 opened 2 months ago
This is expected, as we leave the provided tensors alone (but will change the recommended init scheme), please use benchmark.model.get_parameter('lm_head.weight')[:10]
to get the sharded weights.
I tested it and benchmark.model.get_parameter('lm_head.weight')[:10]
still gives shape [10, 4096] for Thunder and [10] for Eager. Also it is expected that the values of parameters are different between Thunder and Eager?
Is model the original model or the thunder module?
In case of Thunder it's Thunder module: thunder.core.module.ThunderModule. IN case of Eager it's the original module.
The value of loss is also different between Thunder and Eager: Eager:
iter 0: loss 11.9375, iter time: 6618.87ms, t: 8192 iter 1: loss 9.8750, iter time: 1466.43ms, t: 8192 iter 2: loss 5.9375, iter time: 1097.02ms, t: 8192 iter 3: loss 4.8125, iter time: 1096.80ms, t: 8192 iter 4: loss 4.6875, iter time: 1093.69ms, t: 8192 iter 5: loss 4.6875, iter time: 1098.55ms, t: 8192 iter 6: loss 4.6562, iter time: 1096.84ms, t: 8192 iter 7: loss 4.6250, iter time: 1098.75ms, t: 8192 iter 8: loss 4.6562, iter time: 1186.29ms, t: 8192 iter 9: loss 4.6562, iter time: 1106.99ms, t: 8192
Thunder
iter 0: loss 11.8750, iter time: 73451.35ms, t: 8192 iter 1: loss 9.5000, iter time: 993.84ms, t: 8192 iter 2: loss 5.7188, iter time: 1012.46ms, t: 8192 iter 3: loss 4.8125, iter time: 1013.92ms, t: 8192 iter 4: loss 4.6875, iter time: 1003.81ms, t: 8192 iter 5: loss 4.7188, iter time: 1015.76ms, t: 8192 iter 6: loss 4.6875, iter time: 1002.12ms, t: 8192 iter 7: loss 4.6562, iter time: 999.51ms, t: 8192 iter 8: loss 4.6562, iter time: 1006.47ms, t: 8192 iter 9: loss 4.6562, iter time: 1012.25ms, t: 8192
When I did training on some real data the difference is much larger: Eager
12:30:08 | Iteration 0: loss 11.9375, time: 5652.83ms 12:30:09 | Iteration 1: loss 11.5000, time: 1446.79ms 12:30:10 | Iteration 2: loss 10.8125, time: 1078.02ms 12:30:11 | Iteration 3: loss 9.0625, time: 1082.69ms 12:30:12 | Iteration 4: loss 8.7500, time: 1084.27ms 12:30:13 | Iteration 5: loss 8.3125, time: 1083.96ms 12:30:14 | Iteration 6: loss 8.3750, time: 1080.23ms 12:30:15 | Iteration 7: loss 8.0625, time: 1081.94ms 12:30:17 | Iteration 8: loss 8.8750, time: 1079.18ms 12:30:18 | Iteration 9: loss 8.3750, time: 1081.14ms 12:30:19 | Iteration 10: loss 7.8438, time: 1084.09ms
Thunder
12:43:51 | Iteration 0: loss 1.0312, time: 74820.20ms 12:43:52 | Iteration 1: loss 13.8750, time: 910.84ms 12:43:52 | Iteration 2: loss 27.2500, time: 920.43ms 12:43:53 | Iteration 3: loss 10.8125, time: 926.16ms 12:43:54 | Iteration 4: loss 9.3125, time: 929.09ms 12:43:55 | Iteration 5: loss 8.2500, time: 921.80ms 12:43:56 | Iteration 6: loss 7.7500, time: 920.07ms 12:43:57 | Iteration 7: loss 7.6562, time: 921.99ms 12:43:58 | Iteration 8: loss 8.3750, time: 923.90ms 12:43:59 | Iteration 9: loss 8.1250, time: 921.77ms 12:44:00 | Iteration 10: loss 7.7188, time: 928.60ms
When I change the seed for Eager mode I get different initial values of loss however they are still oscillating around 10 and the variability is not as large as for Thunder (loss = 1 or 27).
fyi @IvanYashchuk
I will look into this.
🐛 Bug
After training Llama-3-8b on 8 A100 for 10 iterations with eager mode I printed the model weights:
when not using Thunder I got:
when using Thunder I got:
So the shape and values are different. I checked and executing the training script multiple times gives consistent results (so it's not a problem with randomness).
To Reproduce
Start container by running:
Later create in the container training script from the file linked to this issue or add in line 584 of the script
lightning-thunder/thunder/benchmarks/benchmark_litgpt.py
4 lines of code from the bug description.Assuming the newly created script is called
benchmark_litgpt.py
call:The results will be visible in
file_eager_1.txt
andfile_thunder_1.txt
Expected behavior
Shapes and values of the model weights should be the same
Environment
Output from nvidia-smi
Python packages:
lightning 2.3.3 lightning-thunder 0.2.0.dev0
lightning-utilities 0.11.6 litgpt 0.4.5 nvfuser 0.2.8+gitfa2bedc
nvidia-cudnn-frontend 1.5.2 nvidia-pyprof 3.11.0 pytorch-lightning 2.3.3 torch 2.5.0a0+git16a2a1a torchmetrics 1.4.0.post0 torchvision 0.19.0a0+d23a6e1
Additional context
(This is py file, but to attach it here I had to change the extension) benchmark_litgpt.txt
cc @carmocca @crcrpar