Open chenryn opened 4 months ago
改用 4bits qlora 方法,然后调整了 example.sh 里的 size 为 512,acc_complete_example_trainer.py 里的 r 为 16。跑起来了,占用显存是 12.84GB。 但是程序从开始运行到第一个 step 完成就花了大概 20 分钟,感觉还是有其他问题吧?
显存方面可以参考我们的论文With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation,
麻烦您发一下具体实验配置,比如,模型,输入输出长度,是否开启gradient_checkpointing,是否使用fp16、bf16等。
部分模型可能设置了较长的config.max_position_embeddings
,这会在初始化的时候占用很多不必要的显存。
使用的repo 里自带的 example.sh,也就是那个llama2-7b-32k。因为说 16G显存不够,所以我稍微改了一下acc_complete_example_trainer.py:
`(myconda) root@Zry7ol:~/Temp-LoRA-main/trainer# diff acc_complete_example_trainer.py acc_complete_example_trainer.py_bak
11,12d10
< torch.backends.cuda.enable_mem_efficient_sdp(False)
< torch.backends.cuda.enable_flash_sdp(False)
15c13
< from peft import LoraConfig, TaskType, get_peft_model, LoraModel, prepare_model_for_kbit_training
from peft import LoraConfig, TaskType, get_peft_model, LoraModel
18c16
< BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM, PreTrainedTokenizer, PreTrainedModel, GenerationConfig,
AutoTokenizer, AutoModelForCausalLM, PreTrainedTokenizer, PreTrainedModel, GenerationConfig,
144,150d140
< bnb_config = BitsAndBytesConfig(
< load_in_4bit=True,
< bnb_4bit_quant_type="nf4",
< bnb_4bit_use_double_quant=True,
< bnb_4bit_compute_dtype=torch.bfloat16,
< )
<
155,156c145
< "device_map": "auto",
< "quantization_config": bnb_config
"device_map": "cuda"
162,163d150
< model = prepare_model_for_kbit_training(model)
<
170,171c157,158
< r=16,
< lora_alpha=8,
r=64, lora_alpha=64,
173c160
< target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
target_modules=find_all_linear_names(model=model)
215c202
< use_cache=False, return_dict_in_generate=True
use_cache=True, return_dict_in_generate=True`
执行输出的配置如下: [config.py:986:print_user_config] json = { "bf16": { "enabled": true }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2.000000e+08, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2.000000e+08, "contiguous_gradients": true, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 1, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false, "fp16": { "enabled": false }, "zero_allow_untested_optimizer": true }
是在 matpool 上按时付费租的 GPU 主机,用的 pytorch 2.1.1+cuda11.8 的镜像,所以稍微改了一下 requirements.txt,相关的四个库没升 2.2.1。
刚重新做了测试,结果如下: | script | enable gradient checkpointing | mem |
---|---|---|---|
scripts/llama2.sh | True | 22GB | |
scripts/llama2.sh | False | 56GB | |
scripts/example.sh | True | 24GB (step = 2) | |
scripts/example.sh | False | 36GB (step = 2) |
实验环境:A800, cuda11.8,mem由nvidia-smi
命令得到。
实验配置为本项目提供的官方配置。
注意,"model.generate"过程中,transformers会自动进行一些显存的回收、释放,因此显存会有较大的波动。
启动gradient_checkpointing
的方法,以scripts/llama2.sh
为例,将最后的"--gradient_checkpointing" 设为 "true' 即可。
ACCELERATE_CONFIG=""
SAVE_DIR=""
MODEL_NAME="togethercomputer/LLaMA-2-7B-32K"
mkdir -p $SAVE_DIR
accelerate launch --config_file $ACCELERATE_CONFIG trainer/acc_pg19_trainer.py --model_name $MODEL_NAME \
……
--gradient_checkpointing "true"
麻烦您看下开启"gradient_checkpointing"后的结果。
尝试在一个 24G 显存的A30 上运行项目,也报OOM 了。一般 7B 模型的 lora 不是只需要 16GB 显存么?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 15.19 MiB is free. Process 19214 has 23.67 GiB memory in use. Of the allocated memory 22.14 GiB is allocated by PyTorch, and 263.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-04-22 08:02:12,640] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1532) of binary: /root/miniconda3/envs/myconda/bin/python