training数据集是否有问题？

GraphPKU / PiSSA

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models(NeurIPS 2024 Spotlight)

261 stars 9 forks source link

我在本地复现了您的代码，用的是metamathqa的数据集，Llama-2-7b的基础模型，4张40G的A100，但是跑的代码会报下面的错误，请问这是数据集的问题还是代码的问题？我用了您data里面的三种数据集，都会出现这种现象。 /root/anaconda3/envs/pissa/lib/python3.9/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.9801980198019803e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.9603960396039606e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.940594059405941e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 7.920792079207921e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 9.900990099009902e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.1881188118811881e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.3861386138613863e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.5841584158415842e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.7821782178217822e-06, 'epoch': 0.0}

配置如下所示 python train.py \ --model_name_or_path /home/llama \ --data_path data/MetaMathQA-395K-new.json \ --output_dir /data/PiSSA/output/model-metamath-lora \ --init_lora_weights lora \ --report_to none \ --query "query" \ --response "response"\ --merge_and_save True \ --data_length 10000 \ --bf16 True \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 2 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \python train.py \ --model_name_or_path /home/zxl/llama \ --data_path data/MetaMathQA-395K-new.json \ --output_dir /data/wangbowen/PiSSA/output/model-metamath-lora \ --init_lora_weights lora \ --report_to none \ --query "query" \ --response "response"\ --merge_and_save True \ --data_length 10000 \ --bf16 True \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 2 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True

GraphPKU / PiSSA

training数据集是否有问题？ #1