FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.75k stars 563 forks source link

New bug during fine-tune #1178

Open holopyolo opened 4 weeks ago

holopyolo commented 4 weeks ago

image torchrun --nproc_per_node 1 \ -m FlagEmbedding.finetune.reranker.encoder_only.base \ --model_name_or_path BAAI/bge-reranker-v2-m3 \ --cache_dir ./cache/model \ --train_data ./train.json \ --cache_path ./cache/data \ --train_group_size 3 \ --query_max_len 512 \ --passage_max_len 512 \ --pad_to_multiple_of 8 \ --knowledge_distillation False \ --output_dir ./model \ --overwrite_output_dir \ --learning_rate 6e-5 \ --fp16 \ --num_train_epochs 2 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 1 \ --dataloader_drop_last True \ --warmup_ratio 0.1 \ --gradient_checkpointing \ --weight_decay 0.01 \ --deepspeed ../ds_stage0.json \ --logging_steps 10 \ --save_steps 500 not reproducable on early versions

545999961 commented 4 weeks ago

Hello, we didn't encounter this issue in our tests. Could you please provide the specific versions of torch, transformers, deepspeed, and flash-attn you are using?

yichuxue commented 3 weeks ago

我也发现了这个问题 包版本

torch                    2.3.1
sentence-transformers    3.0.1
transformers             4.44.2
deepspeed                0.14.4

我训练时并没有使用flash-attn和deepspeed 训练脚本

source /opt/conda/bin/activate /opt/conda/envs/bge
CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node 2 \
    -m FlagEmbedding.finetune.embedder.encoder_only.m3 \
    --model_name_or_path BAAI/bge-m3 \
  --train_data /mnt/sda/app/embedding/train_data/embedding/20241024 \
  --output_dir /mnt/sda/app/embedding/train_out/embedding/20241024 \
  --train_group_size 8 \
  --query_max_len 256 \
  --passage_max_len 2048 \
  --pad_to_multiple_of 8 \
  --knowledge_distillation True \
  --same_dataset_within_batch True \
  --small_threshold 0 \
  --drop_threshold 0 \
  --overwrite_output_dir \
  --learning_rate 1e-5 \
  --fp16 \
  --num_train_epochs 2 \
  --per_device_train_batch_size 20 \
  --dataloader_drop_last True \
  --warmup_ratio 0.1 \
  --gradient_checkpointing \
  --logging_steps 1 \
  --save_steps 1000 \
  --negatives_cross_device True \
  --temperature 0.02 \
  --sentence_pooling_method cls \
  --normalize_embeddings True \
  --kd_loss_type m3_kd_loss \
  --unified_finetuning True \
  --use_self_distill True \
  --fix_encoder False \
  --self_distill_start_step 0

错误信息

[2024-11-01 07:09:35,232] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
[2024-11-01 07:09:35,356] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
  0%|                                                                                                                                                    | 0/186 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2888: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
  warnings.warn(
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2888: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
  warnings.warn(
/opt/conda/envs/bge/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/opt/conda/envs/bge/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
[rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank1]:     return _run_code(code, main_globals, None,
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/runpy.py", line 86, in _run_code
[rank1]:     exec(code, run_globals)
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/FlagEmbedding/finetune/embedder/encoder_only/m3/__main__.py", line 22, in <module>
[rank1]:     runner.run()
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/FlagEmbedding/abc/finetune/embedder/AbsRunner.py", line 119, in run
[rank1]:     self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint)
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs)
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
[rank1]:     self.accelerator.backward(loss, **kwargs)
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/accelerate/accelerator.py", line 2147, in backward
[rank1]:     self.scaler.scale(loss).backward(**kwargs)
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
[rank1]:     return user_fn(self, *args)
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank1]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank1]: Parameter at index 387 with name model.encoder.layer.23.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
545999961 commented 3 weeks ago
CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node 2 \
  -m FlagEmbedding.finetune.embedder.encoder_only.m3 \
  --model_name_or_path BAAI/bge-m3 \
  --train_data /mnt/sda/app/embedding/train_data/embedding/20241024 \
  --output_dir /mnt/sda/app/embedding/train_out/embedding/20241024 \
  --train_group_size 8 \
  --query_max_len 256 \
  --passage_max_len 2048 \
  --pad_to_multiple_of 8 \
  --knowledge_distillation True \
  --same_dataset_within_batch True \
  --small_threshold 0 \
  --drop_threshold 0 \
  --overwrite_output_dir \
  --learning_rate 1e-5 \
  --fp16 \
  --num_train_epochs 2 \
  --per_device_train_batch_size 20 \
  --dataloader_drop_last True \
  --warmup_ratio 0.1 \
  --gradient_checkpointing \
  --logging_steps 1 \
  --save_steps 1000 \
  --negatives_cross_device True \
  --temperature 0.02 \
  --sentence_pooling_method cls \
  --normalize_embeddings True \
  --kd_loss_type m3_kd_loss \
  --unified_finetuning True \
  --use_self_distill True \
  --fix_encoder False \
  --self_distill_start_step 0

用gradient_checkpointing是需要和deepspeed搭配的,所以可以下载并传入deepspeed参数,或者去掉gradient_checkpointing

yichuxue commented 3 weeks ago

感谢!已经成功跑起来了。

xushan116 commented 1 week ago

@yichuxue 同学请教一下 你跑这个bge-m3 finetune的环境都是什么版本呀?我现在也想跑这个 我现在用的 V100卡 cuda10.1(nvcc -V显式的是这个)或者是cuda12.0(nvidia-smi显式的是12.0) Pytho3.10 torch2.5.1 transformers4.44.2 deepspeed 0.15.4 你能帮看下是哪里有问题吗 或者你列一下 你的所用的版本都是啥,感谢~~

xushan116 commented 1 week ago

@yichuxue 还有就是问一下 你是不使用deepspeed嘛 求教求教

xushan116 commented 1 week ago

@yichuxue 去掉deepspeed 和 gradient_checkpointing 好像是跑通了,,,你是这样跑通的吗 {'loss': 0.0008, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0023, 'grad_norm': 0.7831071615219116, 'learning_rate': 1.25e-06, 'epoch': 0.03}
{'loss': 1.709, 'grad_norm': nan, 'learning_rate': 1.25e-06, 'epoch': 0.04}
{'loss': 0.0433, 'grad_norm': 8.900976181030273, 'learning_rate': 2.5e-06, 'epoch': 0.05}
{'loss': 0.0006, 'grad_norm': 0.20847205817699432, 'learning_rate': 3.7500000000000005e-06, 'epoch': 0.06}
{'loss': 0.7183, 'grad_norm': nan, 'learning_rate': 3.7500000000000005e-06, 'epoch': 0.07}
{'loss': 5.3789, 'grad_norm': nan, 'learning_rate': 3.7500000000000005e-06, 'epoch': 0.09}
{'loss': 0.1038, 'grad_norm': 31.261396408081055, 'learning_rate': 5e-06, 'epoch': 0.1}
{'loss': 0.0, 'grad_norm': 0.000105711464129854, 'learning_rate': 6.25e-06, 'epoch': 0.11}
{'loss': 0.2096, 'grad_norm': 65.53326416015625, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.12}
{'loss': 0.0, 'grad_norm': 8.12459438748192e-06, 'learning_rate': 8.750000000000001e-06, 'epoch': 0.14}
{'loss': 0.0, 'grad_norm': 0.00014261712203733623, 'learning_rate': 1e-05, 'epoch': 0.15}
{'loss': 1.8809, 'grad_norm': nan, 'learning_rate': 1e-05, 'epoch': 0.16}
{'loss': 4.9062, 'grad_norm': nan, 'learning_rate': 1e-05, 'epoch': 0.17}
{'loss': 1.5381, 'grad_norm': 239.84178161621094, 'learning_rate': 9.861111111111112e-06, 'epoch': 0.19}
{'loss': 0.0, 'grad_norm': 1.3043287481195875e-06, 'learning_rate': 9.722222222222223e-06, 'epoch': 0.2}
{'loss': 0.0231, 'grad_norm': 4.734262466430664, 'learning_rate': 9.583333333333335e-06, 'epoch': 0.21}
{'loss': 0.0262, 'grad_norm': 8.099693298339844, 'learning_rate': 9.444444444444445e-06, 'epoch': 0.23}
{'loss': 0.0821, 'grad_norm': 23.373586654663086, 'learning_rate': 9.305555555555557e-06, 'epoch': 0.24}
{'loss': 0.001, 'grad_norm': 0.2686065137386322, 'learning_rate': 9.166666666666666e-06, 'epoch': 0.25}
{'loss': 2.0488, 'grad_norm': 289.2960205078125, 'learning_rate': 9.027777777777779e-06, 'epoch': 0.26}
{'loss': 1.5059, 'grad_norm': 153.73751831054688, 'learning_rate': 8.888888888888888e-06, 'epoch': 0.28}
{'loss': 0.0917, 'grad_norm': 29.95059585571289, 'learning_rate': 8.750000000000001e-06, 'epoch': 0.29}
{'loss': 0.0, 'grad_norm': 0.006617188453674316, 'learning_rate': 8.611111111111112e-06, 'epoch': 0.3}
{'loss': 1.3379, 'grad_norm': 118.33588409423828, 'learning_rate': 8.472222222222223e-06, 'epoch': 0.31}
{'loss': 0.0053, 'grad_norm': 1.879530668258667, 'learning_rate': 8.333333333333334e-06, 'epoch': 0.33}
{'loss': 0.0003, 'grad_norm': 0.11011672019958496, 'learning_rate': 8.194444444444445e-06, 'epoch': 0.34}
{'loss': 2.6172, 'grad_norm': 333.1439514160156, 'learning_rate': 8.055555555555557e-06, 'epoch': 0.35}
{'loss': 0.0062, 'grad_norm': 1.877087116241455, 'learning_rate': 7.916666666666667e-06, 'epoch': 0.36}
{'loss': 0.0, 'grad_norm': 0.009513245895504951, 'learning_rate': 7.77777777777778e-06, 'epoch': 0.38}
{'loss': 0.0861, 'grad_norm': 18.326318740844727, 'learning_rate': 7.638888888888888e-06, 'epoch': 0.39}
{'loss': 5.4883, 'grad_norm': 265.3228759765625, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.4}
{'loss': 8.4609, 'grad_norm': 417.7088623046875, 'learning_rate': 7.361111111111112e-06, 'epoch': 0.41}
{'loss': 0.0058, 'grad_norm': 2.3242433071136475, 'learning_rate': 7.222222222222223e-06, 'epoch': 0.42}
{'loss': 3.1973, 'grad_norm': 245.46603393554688, 'learning_rate': 7.083333333333335e-06, 'epoch': 0.44}
{'loss': 0.0, 'grad_norm': 0.0020497869700193405, 'learning_rate': 6.944444444444445e-06, 'epoch': 0.45}
{'loss': 0.0, 'grad_norm': 0.00045350968139246106, 'learning_rate': 6.8055555555555566e-06, 'epoch': 0.46}
{'loss': 0.2068, 'grad_norm': 56.775306701660156, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.47}
{'loss': 2.7051, 'grad_norm': 254.17144775390625, 'learning_rate': 6.5277777777777784e-06, 'epoch': 0.49}
{'loss': 2.4863, 'grad_norm': 281.6524658203125, 'learning_rate': 6.3888888888888885e-06, 'epoch': 0.5}
{'loss': 0.1627, 'grad_norm': 38.239200592041016, 'learning_rate': 6.25e-06, 'epoch': 0.51}
{'loss': 0.1071, 'grad_norm': 34.477603912353516, 'learning_rate': 6.111111111111112e-06, 'epoch': 0.53}
{'loss': 0.0009, 'grad_norm': 0.31116265058517456, 'learning_rate': 5.972222222222222e-06, 'epoch': 0.54}
{'loss': 0.585, 'grad_norm': 145.36105346679688, 'learning_rate': 5.833333333333334e-06, 'epoch': 0.55}
{'loss': 3.1914, 'grad_norm': 219.47848510742188, 'learning_rate': 5.694444444444445e-06, 'epoch': 0.56}
{'loss': 0.0308, 'grad_norm': 10.527801513671875, 'learning_rate': 5.555555555555557e-06, 'epoch': 0.57}
{'loss': 1.2842, 'grad_norm': 301.1593322753906, 'learning_rate': 5.416666666666667e-06, 'epoch': 0.59}
{'loss': 1.0391, 'grad_norm': 213.5859375, 'learning_rate': 5.2777777777777785e-06, 'epoch': 0.6}
{'loss': 1.3887, 'grad_norm': 224.40745544433594, 'learning_rate': 5.138888888888889e-06, 'epoch': 0.61}
{'loss': 0.0, 'grad_norm': 0.0001462243526475504, 'learning_rate': 5e-06, 'epoch': 0.62}
{'loss': 0.011, 'grad_norm': 3.8770759105682373, 'learning_rate': 4.861111111111111e-06, 'epoch': 0.64}
{'loss': 0.0864, 'grad_norm': 27.435758590698242, 'learning_rate': 4.722222222222222e-06, 'epoch': 0.65}
{'loss': 1.2031, 'grad_norm': 158.62355041503906, 'learning_rate': 4.583333333333333e-06, 'epoch': 0.66}
{'loss': 2.416, 'grad_norm': 382.7137451171875, 'learning_rate': 4.444444444444444e-06, 'epoch': 0.68}
{'loss': 0.009, 'grad_norm': 3.7473411560058594, 'learning_rate': 4.305555555555556e-06, 'epoch': 0.69}
{'loss': 0.0, 'grad_norm': 3.2530779208173044e-06, 'learning_rate': 4.166666666666667e-06, 'epoch': 0.7}
{'loss': 0.0, 'grad_norm': 7.789469691488193e-08, 'learning_rate': 4.027777777777779e-06, 'epoch': 0.71}
{'loss': 5.7305, 'grad_norm': 282.11407470703125, 'learning_rate': 3.88888888888889e-06, 'epoch': 0.72}
{'loss': 0.2406, 'grad_norm': 77.51959228515625, 'learning_rate': 3.7500000000000005e-06, 'epoch': 0.74}
{'loss': 0.0003, 'grad_norm': 0.11655321717262268, 'learning_rate': 3.6111111111111115e-06, 'epoch': 0.75}
{'loss': 0.0007, 'grad_norm': 0.349434494972229, 'learning_rate': 3.4722222222222224e-06, 'epoch': 0.76}
{'loss': 0.0, 'grad_norm': 4.3914107550335757e-07, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.78}
{'loss': 0.2457, 'grad_norm': 74.47624206542969, 'learning_rate': 3.1944444444444443e-06, 'epoch': 0.79}
{'loss': 0.0, 'grad_norm': 0.00029406551038846374, 'learning_rate': 3.055555555555556e-06, 'epoch': 0.8}
{'loss': 0.2075, 'grad_norm': 69.97547149658203, 'learning_rate': 2.916666666666667e-06, 'epoch': 0.81}
{'loss': 5.582, 'grad_norm': 294.60845947265625, 'learning_rate': 2.7777777777777783e-06, 'epoch': 0.82}
{'loss': 1.3311, 'grad_norm': 208.5145721435547, 'learning_rate': 2.6388888888888893e-06, 'epoch': 0.84}
{'loss': 1.0459, 'grad_norm': 224.3618621826172, 'learning_rate': 2.5e-06, 'epoch': 0.85}
{'loss': 0.0302, 'grad_norm': 9.747218132019043, 'learning_rate': 2.361111111111111e-06, 'epoch': 0.86}
{'loss': 7.0195, 'grad_norm': 426.7852478027344, 'learning_rate': 2.222222222222222e-06, 'epoch': 0.88}
{'loss': 0.0005, 'grad_norm': 0.2010657787322998, 'learning_rate': 2.0833333333333334e-06, 'epoch': 0.89}
{'loss': 0.7915, 'grad_norm': 210.749755859375, 'learning_rate': 1.944444444444445e-06, 'epoch': 0.9}
{'loss': 0.1798, 'grad_norm': 51.51899337768555, 'learning_rate': 1.8055555555555557e-06, 'epoch': 0.91}
{'loss': 8.4141, 'grad_norm': 461.1377258300781, 'learning_rate': 1.6666666666666667e-06, 'epoch': 0.93}
{'loss': 0.174, 'grad_norm': 43.665714263916016, 'learning_rate': 1.527777777777778e-06, 'epoch': 0.94}
{'loss': 0.0, 'grad_norm': 0.0017599387792870402, 'learning_rate': 1.3888888888888892e-06, 'epoch': 0.95}
{'loss': 0.9868, 'grad_norm': 191.83932495117188, 'learning_rate': 1.25e-06, 'epoch': 0.96}
{'loss': 7.8672, 'grad_norm': 587.8441772460938, 'learning_rate': 1.111111111111111e-06, 'epoch': 0.97}
{'loss': 0.0, 'grad_norm': 0.0064466665498912334, 'learning_rate': 9.722222222222224e-07, 'epoch': 0.99}
{'loss': 1.6436, 'grad_norm': 314.2796325683594, 'learning_rate': 8.333333333333333e-07, 'epoch': 1.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████| 80/80 [01:15<00:00, 1.11it/s]11/18/2024 20:33:16 - INFO - FlagEmbedding.finetune.embedder.encoder_only.m3.trainer - Saving model checkpoint to /

yichuxue commented 1 week ago

是的,我是这样跑起来的。

xushan116 commented 1 week ago

@yichuxue 再请教一下,你知道train_group_size这个参数是什么含义吗?怎么设置? 还有就是需要用它给的脚本将文章按照不同长度划分吗,如果划分的话训练时参数设置有什么不一样吗?还是划分了就能直接用?