QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG] 惊!关于显存的问题<title> #994

Closed xx-Jiangwen closed 7 months ago

xx-Jiangwen commented 8 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

您好,作者: 目前,我已经体验过LLaMA factory 以及Swift,这两个大模型微调框架,我发现Qwen原生给出的微调脚本在进行指令微调的时候,相同参数下,Qwen原生会更省内存。例如以14B-chat为例子,最大长度6144,zero3,启动脚本如下: torchrun $DISTRIBUTED_ARGS /home/ftpai/code/Qwen/finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --bf16 True \ --output_dir $output_qwen \ --num_train_epochs 7 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 6 \ --save_total_limit 10 \ --learning_rate 1e-4 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "tensorboard" \ --model_max_length 6144 \ --lazy_preprocess True \ --use_lora \ --gradient_checkpointing \ --deepspeed finetune/ds_config_zero3.json zero3如下: { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true },

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

}

Qwen占用资源如下: Qwen-6k 而其他框架则会出现显存不足的原因,请问原生有什么特殊显存处理吗,翻了源码并没有看见特殊的地方。尝试过将LLaMA 和swift的lora精度和qwen原生保持一致bfp16,但并没有解决显存爆炸的问题。希望能解答一下

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:4.34.0
- PyTorch:
torch                         2.0.0+cu118
torchaudio                    2.0.1+cu118
torchvision                   0.15.1+cu118
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.8

备注 | Anything else?

No response

mengban commented 8 months ago

这是bug吗?这是feature[狗头]

xx-Jiangwen commented 8 months ago

这是bug吗?这是feature[狗头]

这是bug吗?这是feature[狗头]

想了解一下用了什么魔法的

jklj077 commented 7 months ago

I suppose there is no magic 👀. It is possible that settings cannot be exactly matched.