GraphPKU / PiSSA

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models(NeurIPS 2024 Spotlight)
https://arxiv.org/abs/2404.02948
261 stars 9 forks source link

training数据集是否有问题? #1

Closed xinlanz closed 6 months ago

xinlanz commented 6 months ago

我在本地复现了您的代码,用的是metamathqa的数据集,Llama-2-7b的基础模型,4张40G的A100,但是跑的代码会报下面的错误,请问这是数据集的问题还是代码的问题?我用了您data里面的三种数据集,都会出现这种现象。 /root/anaconda3/envs/pissa/lib/python3.9/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.9801980198019803e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.9603960396039606e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5.940594059405941e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 7.920792079207921e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 9.900990099009902e-07, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.1881188118811881e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.3861386138613863e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.5841584158415842e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.7821782178217822e-06, 'epoch': 0.0}

xinlanz commented 6 months ago

配置如下所示 python train.py \     --model_name_or_path /home/llama \     --data_path data/MetaMathQA-395K-new.json \     --output_dir /data/PiSSA/output/model-metamath-lora \     --init_lora_weights lora \     --report_to none \     --query "query" \     --response "response"\     --merge_and_save True \     --data_length 10000 \     --bf16 True \     --num_train_epochs 1 \     --per_device_train_batch_size 1 \     --per_device_eval_batch_size 1 \     --gradient_accumulation_steps 1 \     --evaluation_strategy "no" \     --save_strategy "steps" \     --save_steps 1000 \     --save_total_limit 2 \     --learning_rate 2e-5 \     --weight_decay 0. \     --warmup_ratio 0.03 \     --lr_scheduler_type "cosine" \     --logging_steps 1 \python train.py \     --model_name_or_path /home/zxl/llama \     --data_path data/MetaMathQA-395K-new.json \     --output_dir /data/wangbowen/PiSSA/output/model-metamath-lora \     --init_lora_weights lora \     --report_to none \     --query "query" \     --response "response"\     --merge_and_save True \     --data_length 10000 \     --bf16 True \     --num_train_epochs 1 \     --per_device_train_batch_size 1 \     --per_device_eval_batch_size 1 \     --gradient_accumulation_steps 1 \     --evaluation_strategy "no" \     --save_strategy "steps" \     --save_steps 1000 \     --save_total_limit 2 \     --learning_rate 2e-5 \     --weight_decay 0. \     --warmup_ratio 0.03 \     --lr_scheduler_type "cosine" \     --logging_steps 1 \     --tf32 True