SAI990323 / TALLRec

Apache License 2.0
213 stars 32 forks source link

About ValueError: #59

Open LWZWTWLWZ opened 5 months ago

LWZWTWLWZ commented 5 months ago

尊敬的作者您好,我在复现您的代码的过程中出现了如下问题,我是将llama2-7b-hf下载到本地通过本地调用实现代码的运行。在运行过程中出现了中断,显示输入有NaN。请问您有好的解决方案吗? (Tallrec) ubuntu@ubuntu:~/0522231063/Githubfuxian/TALLRec/TALLRec-main$ bash ./shell/instruct_7B.sh 1 42 1, 42 lr: 1e-4, dropout: 0.05 , seed: 42, sample: 64 Training Alpaca-LoRA model with params: base_model: /home/ubuntu/llama2-7b-hf train_data_path: /home/ubuntu/0522231063/Githubfuxian/TALLRec/TALLRec-main/data/book/train.json val_data_path: /home/ubuntu/0522231063/Githubfuxian/TALLRec/TALLRec-main/data/book/valid.json sample: 64 seed: 42 output_dir: /home/ubuntu/0522231063/Githubfuxian/TALLRec/TALLRec-main/model_42_64 batch_size: 128 micro_batch_size: 32 num_epochs: 200 learning_rate: 0.0001 cutoff_len: 512 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True group_by_length: True wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: /home/ubuntu/0522231063/Githubfuxian/TALLRec/TALLRec-main/alpaca-lora-7B/adapter_config.json

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.23s/it] Checkpoint /home/ubuntu/0522231063/Githubfuxian/TALLRec/TALLRec-main/alpaca-lora-7B/adapter_config.json/adapter_model.bin not found trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199 Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 1383.30 examples/s] Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2427/2427 [00:01<00:00, 1593.21 examples/s] Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). 0%| | 0/200 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. warnings.warn( {'loss': 1.0435, 'learning_rate': 4e-05, 'epoch': 8.0}
{'eval_loss': 2.0803046226501465, 'eval_auc': 0.499365194424907, 'eval_runtime': 162.0148, 'eval_samples_per_second': 14.98, 'eval_steps_per_second': 1.876, 'epoch': 10.0}
5%|███████▊ | 10/200 [05:12<47:31, 15.01s/it] /home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. warnings.warn( {'loss': 1.0116, 'learning_rate': 8e-05, 'epoch': 16.0}
{'eval_loss': 1.836759328842163, 'eval_auc': 0.5092683767063315, 'eval_runtime': 162.8768, 'eval_samples_per_second': 14.901, 'eval_steps_per_second': 1.866, 'epoch': 20.0}
10%|███████████████▌ | 20/200 [10:25<51:03, 17.02s/it] /home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. warnings.warn( {'loss': 0.9067, 'learning_rate': 9.777777777777778e-05, 'epoch': 24.0}
{'eval_loss': 1.3795101642608643, 'eval_auc': 0.5749956938005082, 'eval_runtime': 162.5594, 'eval_samples_per_second': 14.93, 'eval_steps_per_second': 1.87, 'epoch': 30.0}
15%|███████████████████████▍ | 30/200 [15:38<48:25, 17.09s/it] /home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. warnings.warn( {'loss': 0.7264, 'learning_rate': 9.333333333333334e-05, 'epoch': 32.0}
{'loss': 0.5474, 'learning_rate': 8.888888888888889e-05, 'epoch': 40.0}
20%|███████████████████████████████▏ | 40/200 [18:09<45:34, 17.09s/it] Traceback (most recent call last):██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 304/304 [02:41<00:00, 2.19it/s] File "/home/ubuntu/0522231063/Githubfuxian/TALLRec/TALLRec-main/finetune_rec.py", line 325, in fire.Fire(train) File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) File "/home/ubuntu/0522231063/Githubfuxian/TALLRec/TALLRec-main/finetune_rec.py", line 292, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/transformers/trainer.py", line 2006, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/transformers/trainer.py", line 2287, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/transformers/trainer.py", line 2993, in evaluate output = eval_loop( File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/transformers/trainer.py", line 3281, in evaluation_loop metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels)) File "/home/ubuntu/0522231063/Githubfuxian/TALLRec/TALLRec-main/finetune_rec.py", line 222, in compute_metrics auc = roc_auc_score(pre[1], pre[0]) File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper return func(args, **kwargs) File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/sklearn/metrics/_ranking.py", line 606, in roc_auc_score y_score = check_array(y_score, ensure_2d=False) File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1003, in check_array _assert_all_finite( File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/sklearn/utils/validation.py", line 126, in _assert_all_finite _assert_all_finite_element_wise( File "/home/ubuntu/anaconda3/envs/Tallrec/lib/python3.10/site-packages/sklearn/utils/validation.py", line 175, in _assert_all_finite_element_wise raise ValueError(msg_err) ValueError: Input contains NaN. 20%|██████████████████████████████▊ | 40/200 [20:51<1:23:27, 31.30s/it]