Attempting to unscale FP16 gradients

sunzhe09 commented 9 months ago

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior: 1.uncommond code in base_model.py 2.torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/tinygptv_stage3.yaml

Expected behavior train success

llizhaoxu commented 9 months ago

Hi Thanks for your interest. Have you removed the comments here? https://github.com/DLYuanGod/TinyGPT-V/blob/12b036a34090fb1f06d81f25c388b13db4c21fe3/README.md?plain=1#L122 and could you please provide more detail about the bug?

sunzhe09 commented 9 months ago

yes,I have remove the comments .log belows: module.llama_model.base_model.model.model.layers.31.input_layernorm.weight module.llama_model.base_model.model.model.layers.31.post_layernorm.weight module.llama_model.base_model.model.model.final_layernorm.weight module.llama_proj.weight module.llama_proj.bias module.llama_proj2.weight module.llama_proj2.bias 2024-01-08 08:08:56,852 [INFO] number of trainable parameters: 45266944 2024-01-08 08:08:56,854 [INFO] Start training epoch 0, 200 iters per inner epoch. [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) Traceback (most recent call last): File "/home/notebook/code/personal/80239864/TinyGPT-V/train.py", line 104, in main() File "/home/notebook/code/personal/80239864/TinyGPT-V/train.py", line 100, in main runner.train() File "/home/notebook/code/personal/80239864/TinyGPT-V/minigpt4/runners/runner_base.py", line 377, in train train_stats = self.train_epoch(cur_epoch) File "/home/notebook/code/personal/80239864/TinyGPT-V/minigpt4/runners/runner_base.py", line 437, in train_epoch return self.task.train_epoch( File "/home/notebook/code/personal/80239864/TinyGPT-V/minigpt4/tasks/base_task.py", line 116, in train_epoch return self._train_inner_loop( File "/home/notebook/code/personal/80239864/TinyGPT-V/minigpt4/tasks/base_task.py", line 232, in _train_inner_loop scaler.step(optimizer) File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/gradscaler.py", line 370, in step self.unscale(optimizer) File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/gradscaler.py", line 284, in unscale optimizer_state["found_inf_per_device"] = self._unscalegrads(optimizer, inv_scale, found_inf, False) File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscalegrads raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients. Traceback (most recent call last): File "/home/notebook/code/personal/80239864/TinyGPT-V/train.py", line 104, in main() File "/home/notebook/code/personal/80239864/TinyGPT-V/train.py", line 100, in main runner.train() File "/home/notebook/code/personal/80239864/TinyGPT-V/minigpt4/runners/runner_base.py", line 377, in train train_stats = self.train_epoch(cur_epoch) File "/home/notebook/code/personal/80239864/TinyGPT-V/minigpt4/runners/runner_base.py", line 437, in train_epoch return self.task.train_epoch( File "/home/notebook/code/personal/80239864/TinyGPT-V/minigpt4/tasks/base_task.py", line 116, in train_epoch return self._train_inner_loop( File "/home/notebook/code/personal/80239864/TinyGPT-V/minigpt4/tasks/base_task.py", line 232, in _train_inner_loop scaler.step(optimizer) File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/gradscaler.py", line 370, in step self.unscale(optimizer) File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/gradscaler.py", line 284, in unscale optimizer_state["found_inf_per_device"] = self._unscalegrads(optimizer, inv_scale, found_inf, False) File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscalegrads raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients.

sunzhe09 commented 9 months ago

can you reproduce the errors? I test it on 2 V100 GPUs

llizhaoxu commented 9 months ago

Hi

I don't meet these errors, Did you use our environment.yml to bulid the env?

sunzhe09 commented 9 months ago

no when I create the env I got a cmake error,so I just install it by pip

llizhaoxu commented 9 months ago

Hi

There is a "pip" part in the environment.yml, you can create an env python==3.9 and pip install the right version. These errors may be caused by dependence.

sunzhe09 commented 9 months ago

after create the environment,the error is missing.Thank you

DLYuanGod / TinyGPT-V

Attempting to unscale FP16 gradients #17