[2024-09-07 13:00:42,421] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
[2024-09-07 13:00:43,534] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-07 13:00:43,534] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING:root:FSDP or ZeRO3 are not incompatible with QLoRA.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
low_cpu_mem_usage was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.22s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Currently using LoRA for fine-tuning the MiniCPM-V model.
Traceback (most recent call last):
File "/work/MiniCPM-V/finetune/finetune.py", line 299, in
train()
File "/work/MiniCPM-V/finetune/finetune.py", line 243, in train
model = get_peft_model(model, lora_config)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/mapping.py", line 179, in get_peft_model
return PeftModel(model, peft_config, adapter_name=adapter_name, autocast_adapter_dtype=autocast_adapter_dtype)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/peft_model.py", line 155, in init
self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 139, in init
super().init(model, config, adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 175, in init
self.inject_adapter(self.model, adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 417, in inject_adapter
new_module = ModulesToSaveWrapper(target, adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/utils/other.py", line 195, in init
self.update(adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/utils/other.py", line 245, in update
self.modules_to_save[adapter_name].requiresgrad(True)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2440, in requiresgrad
p.requiresgrad(requires_grad)
RuntimeError: only Tensors of floating point dtype can require gradients
[2024-09-07 13:00:50,735] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2187) of binary: /root/miniconda3/envs/minicpm/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/minicpm/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures:
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-07_13:00:50
host : 555898d76c84
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2187)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
环境为:
torch==2.1.2
torchvision== 0.16.0
显卡为:4060Ti 16G显存
finetune_lora.sh文件如下:
GPUS_PER_NODE=1
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001
MODEL="/work/MiniCPM-V/check_point/OpenBMB/MiniCPM-Llama3-V-2_5-int4" # or openbmb/MiniCPM-V-2, openbmb/MiniCPM-Llama3-V-2_5
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="/work/MiniCPM-V/minicpm_data/data/train.json"
EVAL_DATA="/work/MiniCPM-V/minicpm_data/eval/eval.json"
LLM_TYPE="llama3"
# if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpm
#if use openbmb/MiniCPM-Llama3-V-2_5, please set LLM_TYPE=llama3
MODEL_MAX_Length=2048 # if conduct multi-images sft, please set MODEL_MAX_Length=4096
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
--model_name_or_path $MODEL \
--llm_type $LLM_TYPE \
--data_path $DATA \
--eval_data_path $EVAL_DATA \
--remove_unused_columns false \
--label_names "labels" \
--prediction_loss_only false \
--bf16 false \
--bf16_full_eval false \
--fp16 true \
--fp16_full_eval true \
--do_train \
--do_eval \
--tune_llm false \
--use_lora true \
--q_lora true \
--tune_vision true \
--lora_target_modules "llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj|v_proj|o_proj)" \
--model_max_length $MODEL_MAX_Length \
--max_slice_nums 9 \
--max_steps 10000 \
--eval_steps 1000 \
--output_dir output/output__lora \
--logging_dir output/output_lora \
--logging_strategy "steps" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "steps" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10 \
--learning_rate 1e-6 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--gradient_checkpointing true \
--deepspeed ds_config_zero3.json \
--report_to "tensorboard" # wandb
在用MiniCPM-Llama3-V-2_5-int4进行微调测试时出现如下错误。
[2024-09-07 13:00:42,421] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [2024-09-07 13:00:43,534] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-07 13:00:43,534] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl WARNING:root:FSDP or ZeRO3 are not incompatible with QLoRA. Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
train()
File "/work/MiniCPM-V/finetune/finetune.py", line 243, in train
model = get_peft_model(model, lora_config)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/mapping.py", line 179, in get_peft_model
return PeftModel(model, peft_config, adapter_name=adapter_name, autocast_adapter_dtype=autocast_adapter_dtype)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/peft_model.py", line 155, in init
self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 139, in init
super().init(model, config, adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 175, in init
self.inject_adapter(self.model, adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 417, in inject_adapter
new_module = ModulesToSaveWrapper(target, adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/utils/other.py", line 195, in init
self.update(adapter_name)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/peft/utils/other.py", line 245, in update
self.modules_to_save[adapter_name].requiresgrad(True)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2440, in requiresgrad
p.requiresgrad(requires_grad)
RuntimeError: only Tensors of floating point dtype can require gradients
[2024-09-07 13:00:50,735] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2187) of binary: /root/miniconda3/envs/minicpm/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/minicpm/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/minicpm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
low_cpu_mem_usage
was None, now set to True since model is quantized. Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.22s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Currently using LoRA for fine-tuning the MiniCPM-V model. Traceback (most recent call last): File "/work/MiniCPM-V/finetune/finetune.py", line 299, infinetune.py FAILED
Failures: