FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.48k stars 538 forks source link

在做Reranker的时候,发生的错误 #787

Closed wang-ship-it closed 5 months ago

wang-ship-it commented 5 months ago

CUDA_VISIBLE_DEVICES=4,5,6,7 nohup python3 -m torch.distributed.run --nproc_per_node=4 \ -m FlagEmbedding.llm_reranker.finetune_for_layerwise.run \ --output_dir ./results_wudi \ --model_name_or_path ./bge-reranker-v2-minicpm-layerwise \ --train_data ./train.jsonl \ --learning_rate 2e-4 \ --num_train_epochs 50 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 16 \ --dataloader_drop_last True \ --query_max_len 1024 \ --passage_max_len 512 \ --train_group_size 2 \ --logging_steps 100 \ --save_steps 2000 \ --save_total_limit 50 \ --ddp_find_unused_parameters True \ --gradient_checkpointing \ --warmup_ratio 0.1 \ --fp16 \ --use_lora True \ --lora_rank 32 \ --lora_alpha 64 \ --use_flash_attn False \ --target_modules q_proj k_proj v_proj o_proj \ --start_layer 8 \ --head_multi True \ --head_type simple \ --lora_extra_parameters linear_head > wudi.log 2>&1 &

File "/dockerdata/anaconda3/envs/wkl_bge/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward

File "/dockerdata/anaconda3/envs/wkl_bge/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError : Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward passExpected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 319 with name model.base_model.model.model.layers.39.self_attn.o_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.

ddp_find_unused_parameters 这个参数不好设置 为True为False都会错误 只能不用ddp训练吗

wang-ship-it commented 5 months ago

找到原因了 去掉--gradient_checkpointing 就行了