HKUDS / UrbanGPT

[KDD'2024] "UrbanGPT: Spatio-Temporal Large Language Models"
https://urban-gpt.github.io
Apache License 2.0
204 stars 26 forks source link

为什么训练过程中,只能跑到checkpoint-2400,就报错 #21

Open haomengt opened 1 week ago

haomengt commented 1 week ago

作者您好,我想请问一下,为什么我按照您的那个训练的指令执行的,训练总是跑到checkpont-2400那里就有问题呢,报错如下图所示, 微信图片_20240919153028 微信图片_20240919153036 {'loss': 9.7984, 'grad_norm': 13.973094940185547, 'learning_rate': 0.000741541788969566, 'epoch': 0.03} 1%| | 2400/215760 [2:58:49<265:49:35, 4.49s/it]output_dir /apps/data/models/urbangpt/UrbanGPT/checkpoints/UrbanGPT/checkpoint-2400 up /apps/data/models/urbangpt/UrbanGPT/checkpoints/UrbanGPT/st_projector checkpoint-2400 Traceback (most recent call last): File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/generation/configuration_utils.py", line 771, in save_pretrained raise ValueError(str([w.message for w in caught_warnings])) ValueError: [UserWarning('do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.'), UserWarning('do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.')]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/train_mem.py", line 30, in train() File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/train_st.py", line 822, in train trainer.train() File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train return inner_training_loop( File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2356, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2886, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 3454, in save_model self._save(output_dir) File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/stchat_trainer.py", line 56, in _save super(STChatTrainer, self)._save(output_dir, state_dict) File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 3525, in _save self.model.save_pretrained( File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2593, in save_pretrained model_to_save.generation_config.save_pretrained(save_directory) File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/generation/configuration_utils.py", line 773, in save_pretrained raise ValueError( ValueError: The generation config instance is invalid -- .validate() throws warnings and/or exceptions. Fix these issues to save the configuration.

Thrown during validation: UserWarning('do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.'), UserWarning('do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.'): Traceback (most recent call last): rank0: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/generation/configuration_utils.py", line 771, in save_pretrained rank0: raise ValueError(str([w.message for w in caught_warnings])) rank0: ValueError: [UserWarning('do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.'), UserWarning('do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.')]

rank0: During handling of the above exception, another exception occurred:

rank0: Traceback (most recent call last): rank0: File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/train_mem.py", line 30, in

rank0: File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/train_st.py", line 822, in train

rank0: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train rank0: return inner_training_loop( rank0: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2356, in _inner_training_loop rank0: self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) rank0: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate rank0: self._save_checkpoint(model, trial, metrics=metrics) rank0: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 2886, in _save_checkpoint rank0: self.save_model(output_dir, _internal_call=True) rank0: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 3454, in save_model

rank0: File "/apps/data/models/urbangpt/UrbanGPT/urbangpt/train/stchat_trainer.py", line 56, in _save rank0: super(STChatTrainer, self)._save(output_dir, state_dict) rank0: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/trainer.py", line 3525, in _save

rank0: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2593, in save_pretrained

rank0: File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/transformers/generation/configuration_utils.py", line 773, in save_pretrained rank0: raise ValueError( rank0: ValueError: The generation config instance is invalid -- .validate() throws warnings and/or exceptions. Fix these issues to save the configuration.

rank0: Thrown during validation: rank0: [UserWarning('do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.'), UserWarning('do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p.')] wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /apps/data/models/urbangpt/UrbanGPT/wandb/offline-run-20240918_231037-16ynzesq wandb: Find logs at: wandb/offline-run-20240918_231037-16ynzesq/logs W0919 02:09:32.451000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492236 closing signal SIGTERM W0919 02:09:32.452000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492237 closing signal SIGTERM W0919 02:09:32.452000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492238 closing signal SIGTERM W0919 02:09:32.452000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492239 closing signal SIGTERM W0919 02:09:32.453000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 492240 closing signal SIGTERM E0919 02:09:33.696000 140088041554048 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 492235) of binary: /apps/data/conda/envs/urbanGPT/bin/python Traceback (most recent call last): File "/apps/data/conda/envs/urbanGPT/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/apps/data/conda/envs/urbanGPT/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 905, in main() File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/apps/data/conda/envs/urbanGPT/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

LZH-YS1998 commented 3 days ago

Hello, I'm sorry that I haven't encountered this error. Maybe this link can help you: ValueError: The generation config instance is invalid.

"This error appears to be a problem that occurred while upgrading transformers version. I fixed this problem by manually adding do_sample: true in vicuna's generation_config.json file."