Closed JerryDaHeLian closed 10 months ago
Can you share the complete error stacktrace?
Can you share the complete error stacktrace?
Thank you very much for your reply!
The stacktrace:
File "pretrain/tinyllama.py", line 533, in
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, *kwargs)
File "pretrain/tinyllama.py", line 410, in validate
logits = model(input_ids)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/lightning/fabric/wrappers.py", line 121, in forward
output = self._forward_module(*args, *kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward
output = self._fsdp_wrapped_module(*args, *kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, kwargs)
File "/home/xxx/TinyLlama/litgpt/model.py", line 107, in forward
x, * = block(x, (cos, sin), max_seq_length)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward
output = self._fsdp_wrapped_module(*args, *kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1528, in _call_impl
return forward_call(*args, kwargs)
File "/home/xxx/TinyLlama/lit_gpt/model.py", line 172, in forward
h, new_kv_cache = self.attn(n_1, rope, max_seq_length, mask, input_pos, kv_cache)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1519, in _wrapped_call_impl
Traceback (most recent call last):
File "pretrain/tinyllama.py", line 533, in
The tinyllama.py originates from :https://github.com/jzhang38/TinyLlama/blob/main/pretrain/tinyllama.py
Can you share the complete error stacktrace?
8 v100s GPUs(Tesla V100S-PCIE-32GB) included in the development environment。 Prior to this, I was able to successfully pre train using 8 A100-40G( NVIDIA A100-PCIE-40GB) cards。
Looks like you chose the wrong repository! You should re-open this on https://github.com/jzhang38/TinyLlama
When I pre trained a large language model on Tesla V100S-PCIE-32GB: lightning run model --node-rank=0 --main-address=10.142.6.35 --accelerator=cuda --devices="0,1,2,3,6,7" --num-nodes=1 pretrain/tinyllama.py --device_num 6 --train_data_dir data/slim_star_combined --val_data_dir data/slim_star_combined
Error: untimeError: Expected x1.dtype() == cos.dtype() to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
What's going on? Who can help me?