请教一下为什么微调embedding模型涉及passage_max_len参数

WSC741606 commented 4 months ago

请教一下大佬，如题，在 https://github.com/NLPJCL/RAG-Retrieval/blob/master/rag_retrieval/train/embedding/train_embedding.py 中有passage_max_len 参数，以BAAI/bge-base-zh-v1.5为例，如果将该参数设置为超过512就会报错

File "/data/home/user/Test/Env/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 1072, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (512) at non-singleton dimension 1.  Target sizes: [128, 1024].  Tensor sizes: [1, 512]

但在生成数据时用的片段长度为1024，这里是怎么处理的？直接截断前512吗？而如 https://github.com/percent4/embedding_model_exp/blob/main/src/finetune/ft_embedding.py 中用sentence-transformer训练就没有遇到这个问题，是sentence-transformer库已经预处理了吗？这个512的限制是可以人为扩展的吗？

WSC741606 commented 4 months ago

另外微信群的二维码过期了，求新二维码

WSC741606 commented 4 months ago

找到了，512应该是模型本身的序列长度限制，那另一个库应该是自动截断了但我试图换bge-m3（有8192的序列长度支持）时报错

Batch size: 128
Start with seed: 666
Output dir: ./output/Test_mrl1792
Model_name_or_path: BAAI/bge-m3
Dataset: ../../../Data/Test.train.jsonl
mixed_precision: fp16
gradient_accumulation_steps: 1
temperature: 0.02
log_with: wandb
neg_nums: 15
query_max_len: 128
passage_max_len: 1024
use_mrl: True
mrl_dims: [128, 256, 512, 768, 1024, 1280, 1536, 1792]
/data/home/user/Test/user-Env/lib/python3.9/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
sentence_transformers model is not mrl model, init scaling_layer weight.
Traceback (most recent call last):
  File "/data/home/user/Test/GitLibrary/RAG-Retrieval/rag_retrieval/train/embedding/train_embedding.py", line 190, in <module>
    main()
  File "/data/home/user/Test/GitLibrary/RAG-Retrieval/rag_retrieval/train/embedding/train_embedding.py", line 117, in main
    model = accelerator.prepare(model)
  File "/data/home/user/Test/user-Env/lib/python3.9/site-packages/accelerate/accelerator.py", line 1304, in prepare
    result = tuple(
  File "/data/home/user/Test/user-Env/lib/python3.9/site-packages/accelerate/accelerator.py", line 1305, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/data/home/user/Test/user-Env/lib/python3.9/site-packages/accelerate/accelerator.py", line 1181, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/data/home/user/Test/user-Env/lib/python3.9/site-packages/accelerate/accelerator.py", line 1461, in prepare_model
    self.state.fsdp_plugin.set_auto_wrap_policy(model)
  File "/data/home/user/Test/user-Env/lib/python3.9/site-packages/accelerate/utils/dataclasses.py", line 1367, in set_auto_wrap_policy
    raise Exception("Could not find the transformer layer class to wrap in the model.")
Exception: Could not find the transformer layer class to wrap in the model.

NLPJCL commented 4 months ago

请参考https://github.com/NLPJCL/RAG-Retrieval/issues/5 修改配置文件。

WSC741606 commented 4 months ago

明白了，感谢大佬回复~我试试

WSC741606 commented 4 months ago

正常进行训练了，感谢感谢

NLPJCL commented 4 months ago

微信群聊已更新

NLPJCL / RAG-Retrieval

请教一下为什么微调embedding模型涉及passage_max_len参数 #28