Open holopyolo opened 4 weeks ago
Hello, we didn't encounter this issue in our tests. Could you please provide the specific versions of torch, transformers, deepspeed, and flash-attn you are using?
我也发现了这个问题 包版本
torch 2.3.1
sentence-transformers 3.0.1
transformers 4.44.2
deepspeed 0.14.4
我训练时并没有使用flash-attn和deepspeed 训练脚本
source /opt/conda/bin/activate /opt/conda/envs/bge
CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node 2 \
-m FlagEmbedding.finetune.embedder.encoder_only.m3 \
--model_name_or_path BAAI/bge-m3 \
--train_data /mnt/sda/app/embedding/train_data/embedding/20241024 \
--output_dir /mnt/sda/app/embedding/train_out/embedding/20241024 \
--train_group_size 8 \
--query_max_len 256 \
--passage_max_len 2048 \
--pad_to_multiple_of 8 \
--knowledge_distillation True \
--same_dataset_within_batch True \
--small_threshold 0 \
--drop_threshold 0 \
--overwrite_output_dir \
--learning_rate 1e-5 \
--fp16 \
--num_train_epochs 2 \
--per_device_train_batch_size 20 \
--dataloader_drop_last True \
--warmup_ratio 0.1 \
--gradient_checkpointing \
--logging_steps 1 \
--save_steps 1000 \
--negatives_cross_device True \
--temperature 0.02 \
--sentence_pooling_method cls \
--normalize_embeddings True \
--kd_loss_type m3_kd_loss \
--unified_finetuning True \
--use_self_distill True \
--fix_encoder False \
--self_distill_start_step 0
错误信息
[2024-11-01 07:09:35,232] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[2024-11-01 07:09:35,356] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
0%| | 0/186 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2888: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
warnings.warn(
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2888: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
warnings.warn(
/opt/conda/envs/bge/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/opt/conda/envs/bge/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank1]: return _run_code(code, main_globals, None,
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/runpy.py", line 86, in _run_code
[rank1]: exec(code, run_globals)
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/FlagEmbedding/finetune/embedder/encoder_only/m3/__main__.py", line 22, in <module>
[rank1]: runner.run()
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/FlagEmbedding/abc/finetune/embedder/AbsRunner.py", line 119, in run
[rank1]: self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint)
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
[rank1]: return inner_training_loop(
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/transformers/trainer.py", line 3349, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/accelerate/accelerator.py", line 2147, in backward
[rank1]: self.scaler.scale(loss).backward(**kwargs)
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
[rank1]: return user_fn(self, *args)
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank1]: torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/opt/conda/envs/bge/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank1]: Parameter at index 387 with name model.encoder.layer.23.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 \ -m FlagEmbedding.finetune.embedder.encoder_only.m3 \ --model_name_or_path BAAI/bge-m3 \ --train_data /mnt/sda/app/embedding/train_data/embedding/20241024 \ --output_dir /mnt/sda/app/embedding/train_out/embedding/20241024 \ --train_group_size 8 \ --query_max_len 256 \ --passage_max_len 2048 \ --pad_to_multiple_of 8 \ --knowledge_distillation True \ --same_dataset_within_batch True \ --small_threshold 0 \ --drop_threshold 0 \ --overwrite_output_dir \ --learning_rate 1e-5 \ --fp16 \ --num_train_epochs 2 \ --per_device_train_batch_size 20 \ --dataloader_drop_last True \ --warmup_ratio 0.1 \ --gradient_checkpointing \ --logging_steps 1 \ --save_steps 1000 \ --negatives_cross_device True \ --temperature 0.02 \ --sentence_pooling_method cls \ --normalize_embeddings True \ --kd_loss_type m3_kd_loss \ --unified_finetuning True \ --use_self_distill True \ --fix_encoder False \ --self_distill_start_step 0
用gradient_checkpointing是需要和deepspeed搭配的,所以可以下载并传入deepspeed参数,或者去掉gradient_checkpointing
感谢!已经成功跑起来了。
@yichuxue 同学请教一下 你跑这个bge-m3 finetune的环境都是什么版本呀?我现在也想跑这个 我现在用的 V100卡 cuda10.1(nvcc -V显式的是这个)或者是cuda12.0(nvidia-smi显式的是12.0) Pytho3.10 torch2.5.1 transformers4.44.2 deepspeed 0.15.4 你能帮看下是哪里有问题吗 或者你列一下 你的所用的版本都是啥,感谢~~
@yichuxue 还有就是问一下 你是不使用deepspeed嘛 求教求教
@yichuxue 去掉deepspeed 和 gradient_checkpointing 好像是跑通了,,,你是这样跑通的吗
{'loss': 0.0008, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0023, 'grad_norm': 0.7831071615219116, 'learning_rate': 1.25e-06, 'epoch': 0.03}
{'loss': 1.709, 'grad_norm': nan, 'learning_rate': 1.25e-06, 'epoch': 0.04}
{'loss': 0.0433, 'grad_norm': 8.900976181030273, 'learning_rate': 2.5e-06, 'epoch': 0.05}
{'loss': 0.0006, 'grad_norm': 0.20847205817699432, 'learning_rate': 3.7500000000000005e-06, 'epoch': 0.06}
{'loss': 0.7183, 'grad_norm': nan, 'learning_rate': 3.7500000000000005e-06, 'epoch': 0.07}
{'loss': 5.3789, 'grad_norm': nan, 'learning_rate': 3.7500000000000005e-06, 'epoch': 0.09}
{'loss': 0.1038, 'grad_norm': 31.261396408081055, 'learning_rate': 5e-06, 'epoch': 0.1}
{'loss': 0.0, 'grad_norm': 0.000105711464129854, 'learning_rate': 6.25e-06, 'epoch': 0.11}
{'loss': 0.2096, 'grad_norm': 65.53326416015625, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.12}
{'loss': 0.0, 'grad_norm': 8.12459438748192e-06, 'learning_rate': 8.750000000000001e-06, 'epoch': 0.14}
{'loss': 0.0, 'grad_norm': 0.00014261712203733623, 'learning_rate': 1e-05, 'epoch': 0.15}
{'loss': 1.8809, 'grad_norm': nan, 'learning_rate': 1e-05, 'epoch': 0.16}
{'loss': 4.9062, 'grad_norm': nan, 'learning_rate': 1e-05, 'epoch': 0.17}
{'loss': 1.5381, 'grad_norm': 239.84178161621094, 'learning_rate': 9.861111111111112e-06, 'epoch': 0.19}
{'loss': 0.0, 'grad_norm': 1.3043287481195875e-06, 'learning_rate': 9.722222222222223e-06, 'epoch': 0.2}
{'loss': 0.0231, 'grad_norm': 4.734262466430664, 'learning_rate': 9.583333333333335e-06, 'epoch': 0.21}
{'loss': 0.0262, 'grad_norm': 8.099693298339844, 'learning_rate': 9.444444444444445e-06, 'epoch': 0.23}
{'loss': 0.0821, 'grad_norm': 23.373586654663086, 'learning_rate': 9.305555555555557e-06, 'epoch': 0.24}
{'loss': 0.001, 'grad_norm': 0.2686065137386322, 'learning_rate': 9.166666666666666e-06, 'epoch': 0.25}
{'loss': 2.0488, 'grad_norm': 289.2960205078125, 'learning_rate': 9.027777777777779e-06, 'epoch': 0.26}
{'loss': 1.5059, 'grad_norm': 153.73751831054688, 'learning_rate': 8.888888888888888e-06, 'epoch': 0.28}
{'loss': 0.0917, 'grad_norm': 29.95059585571289, 'learning_rate': 8.750000000000001e-06, 'epoch': 0.29}
{'loss': 0.0, 'grad_norm': 0.006617188453674316, 'learning_rate': 8.611111111111112e-06, 'epoch': 0.3}
{'loss': 1.3379, 'grad_norm': 118.33588409423828, 'learning_rate': 8.472222222222223e-06, 'epoch': 0.31}
{'loss': 0.0053, 'grad_norm': 1.879530668258667, 'learning_rate': 8.333333333333334e-06, 'epoch': 0.33}
{'loss': 0.0003, 'grad_norm': 0.11011672019958496, 'learning_rate': 8.194444444444445e-06, 'epoch': 0.34}
{'loss': 2.6172, 'grad_norm': 333.1439514160156, 'learning_rate': 8.055555555555557e-06, 'epoch': 0.35}
{'loss': 0.0062, 'grad_norm': 1.877087116241455, 'learning_rate': 7.916666666666667e-06, 'epoch': 0.36}
{'loss': 0.0, 'grad_norm': 0.009513245895504951, 'learning_rate': 7.77777777777778e-06, 'epoch': 0.38}
{'loss': 0.0861, 'grad_norm': 18.326318740844727, 'learning_rate': 7.638888888888888e-06, 'epoch': 0.39}
{'loss': 5.4883, 'grad_norm': 265.3228759765625, 'learning_rate': 7.500000000000001e-06, 'epoch': 0.4}
{'loss': 8.4609, 'grad_norm': 417.7088623046875, 'learning_rate': 7.361111111111112e-06, 'epoch': 0.41}
{'loss': 0.0058, 'grad_norm': 2.3242433071136475, 'learning_rate': 7.222222222222223e-06, 'epoch': 0.42}
{'loss': 3.1973, 'grad_norm': 245.46603393554688, 'learning_rate': 7.083333333333335e-06, 'epoch': 0.44}
{'loss': 0.0, 'grad_norm': 0.0020497869700193405, 'learning_rate': 6.944444444444445e-06, 'epoch': 0.45}
{'loss': 0.0, 'grad_norm': 0.00045350968139246106, 'learning_rate': 6.8055555555555566e-06, 'epoch': 0.46}
{'loss': 0.2068, 'grad_norm': 56.775306701660156, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.47}
{'loss': 2.7051, 'grad_norm': 254.17144775390625, 'learning_rate': 6.5277777777777784e-06, 'epoch': 0.49}
{'loss': 2.4863, 'grad_norm': 281.6524658203125, 'learning_rate': 6.3888888888888885e-06, 'epoch': 0.5}
{'loss': 0.1627, 'grad_norm': 38.239200592041016, 'learning_rate': 6.25e-06, 'epoch': 0.51}
{'loss': 0.1071, 'grad_norm': 34.477603912353516, 'learning_rate': 6.111111111111112e-06, 'epoch': 0.53}
{'loss': 0.0009, 'grad_norm': 0.31116265058517456, 'learning_rate': 5.972222222222222e-06, 'epoch': 0.54}
{'loss': 0.585, 'grad_norm': 145.36105346679688, 'learning_rate': 5.833333333333334e-06, 'epoch': 0.55}
{'loss': 3.1914, 'grad_norm': 219.47848510742188, 'learning_rate': 5.694444444444445e-06, 'epoch': 0.56}
{'loss': 0.0308, 'grad_norm': 10.527801513671875, 'learning_rate': 5.555555555555557e-06, 'epoch': 0.57}
{'loss': 1.2842, 'grad_norm': 301.1593322753906, 'learning_rate': 5.416666666666667e-06, 'epoch': 0.59}
{'loss': 1.0391, 'grad_norm': 213.5859375, 'learning_rate': 5.2777777777777785e-06, 'epoch': 0.6}
{'loss': 1.3887, 'grad_norm': 224.40745544433594, 'learning_rate': 5.138888888888889e-06, 'epoch': 0.61}
{'loss': 0.0, 'grad_norm': 0.0001462243526475504, 'learning_rate': 5e-06, 'epoch': 0.62}
{'loss': 0.011, 'grad_norm': 3.8770759105682373, 'learning_rate': 4.861111111111111e-06, 'epoch': 0.64}
{'loss': 0.0864, 'grad_norm': 27.435758590698242, 'learning_rate': 4.722222222222222e-06, 'epoch': 0.65}
{'loss': 1.2031, 'grad_norm': 158.62355041503906, 'learning_rate': 4.583333333333333e-06, 'epoch': 0.66}
{'loss': 2.416, 'grad_norm': 382.7137451171875, 'learning_rate': 4.444444444444444e-06, 'epoch': 0.68}
{'loss': 0.009, 'grad_norm': 3.7473411560058594, 'learning_rate': 4.305555555555556e-06, 'epoch': 0.69}
{'loss': 0.0, 'grad_norm': 3.2530779208173044e-06, 'learning_rate': 4.166666666666667e-06, 'epoch': 0.7}
{'loss': 0.0, 'grad_norm': 7.789469691488193e-08, 'learning_rate': 4.027777777777779e-06, 'epoch': 0.71}
{'loss': 5.7305, 'grad_norm': 282.11407470703125, 'learning_rate': 3.88888888888889e-06, 'epoch': 0.72}
{'loss': 0.2406, 'grad_norm': 77.51959228515625, 'learning_rate': 3.7500000000000005e-06, 'epoch': 0.74}
{'loss': 0.0003, 'grad_norm': 0.11655321717262268, 'learning_rate': 3.6111111111111115e-06, 'epoch': 0.75}
{'loss': 0.0007, 'grad_norm': 0.349434494972229, 'learning_rate': 3.4722222222222224e-06, 'epoch': 0.76}
{'loss': 0.0, 'grad_norm': 4.3914107550335757e-07, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.78}
{'loss': 0.2457, 'grad_norm': 74.47624206542969, 'learning_rate': 3.1944444444444443e-06, 'epoch': 0.79}
{'loss': 0.0, 'grad_norm': 0.00029406551038846374, 'learning_rate': 3.055555555555556e-06, 'epoch': 0.8}
{'loss': 0.2075, 'grad_norm': 69.97547149658203, 'learning_rate': 2.916666666666667e-06, 'epoch': 0.81}
{'loss': 5.582, 'grad_norm': 294.60845947265625, 'learning_rate': 2.7777777777777783e-06, 'epoch': 0.82}
{'loss': 1.3311, 'grad_norm': 208.5145721435547, 'learning_rate': 2.6388888888888893e-06, 'epoch': 0.84}
{'loss': 1.0459, 'grad_norm': 224.3618621826172, 'learning_rate': 2.5e-06, 'epoch': 0.85}
{'loss': 0.0302, 'grad_norm': 9.747218132019043, 'learning_rate': 2.361111111111111e-06, 'epoch': 0.86}
{'loss': 7.0195, 'grad_norm': 426.7852478027344, 'learning_rate': 2.222222222222222e-06, 'epoch': 0.88}
{'loss': 0.0005, 'grad_norm': 0.2010657787322998, 'learning_rate': 2.0833333333333334e-06, 'epoch': 0.89}
{'loss': 0.7915, 'grad_norm': 210.749755859375, 'learning_rate': 1.944444444444445e-06, 'epoch': 0.9}
{'loss': 0.1798, 'grad_norm': 51.51899337768555, 'learning_rate': 1.8055555555555557e-06, 'epoch': 0.91}
{'loss': 8.4141, 'grad_norm': 461.1377258300781, 'learning_rate': 1.6666666666666667e-06, 'epoch': 0.93}
{'loss': 0.174, 'grad_norm': 43.665714263916016, 'learning_rate': 1.527777777777778e-06, 'epoch': 0.94}
{'loss': 0.0, 'grad_norm': 0.0017599387792870402, 'learning_rate': 1.3888888888888892e-06, 'epoch': 0.95}
{'loss': 0.9868, 'grad_norm': 191.83932495117188, 'learning_rate': 1.25e-06, 'epoch': 0.96}
{'loss': 7.8672, 'grad_norm': 587.8441772460938, 'learning_rate': 1.111111111111111e-06, 'epoch': 0.97}
{'loss': 0.0, 'grad_norm': 0.0064466665498912334, 'learning_rate': 9.722222222222224e-07, 'epoch': 0.99}
{'loss': 1.6436, 'grad_norm': 314.2796325683594, 'learning_rate': 8.333333333333333e-07, 'epoch': 1.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████| 80/80 [01:15<00:00, 1.11it/s]11/18/2024 20:33:16 - INFO - FlagEmbedding.finetune.embedder.encoder_only.m3.trainer - Saving model checkpoint to /
是的,我是这样跑起来的。
@yichuxue 再请教一下,你知道train_group_size这个参数是什么含义吗?怎么设置? 还有就是需要用它给的脚本将文章按照不同长度划分吗,如果划分的话训练时参数设置有什么不一样吗?还是划分了就能直接用?
torchrun --nproc_per_node 1 \ -m FlagEmbedding.finetune.reranker.encoder_only.base \ --model_name_or_path BAAI/bge-reranker-v2-m3 \ --cache_dir ./cache/model \ --train_data ./train.json \ --cache_path ./cache/data \ --train_group_size 3 \ --query_max_len 512 \ --passage_max_len 512 \ --pad_to_multiple_of 8 \ --knowledge_distillation False \ --output_dir ./model \ --overwrite_output_dir \ --learning_rate 6e-5 \ --fp16 \ --num_train_epochs 2 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 1 \ --dataloader_drop_last True \ --warmup_ratio 0.1 \ --gradient_checkpointing \ --weight_decay 0.01 \ --deepspeed ../ds_stage0.json \ --logging_steps 10 \ --save_steps 500 not reproducable on early versions