NLPJCL / RAG-Retrieval

Unify Efficient Fine-tuning of RAG Retrieval, including Embedding, ColBERT, ReRanker.
MIT License
522 stars 48 forks source link

fsdp和deepspeed训练模式 #49

Open liu-yx17 opened 22 hours ago

liu-yx17 commented 22 hours ago

1.对于reranker训练,llm类模型默认使用deepspeed训练,bert类模型默认使用fsdp训练,请问bert类模型如何使用deepspeed训练呢,能否增加使用案例? 2.或者对于embedding训练,如何使用deepspeed训练呢,是否也可以增加使用案例呢?谢谢

NLPJCL commented 22 hours ago

其实只需要把config文件更换了就行。 ../../../config/default_fsdp.yaml 更换为: ../../../config/deepspeed/deepspeed_zero2.yaml reranker和embedding同理。

liu-yx17 commented 22 hours ago

[ERROR] 2024-11-27-19:44:37 (PID:1509814, Device:2, RankID:2) ERR02002 DIST invalid type result = self._prepare_deepspeed(*args) File "/work/home/tel/tel/accelerate/src/accelerate/accelerator.py", line 1851, in _prepare_deepspeed engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/__init__.py", line 193, in initialize engine = DeepSpeedEngine(args=args, File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 269, in __init__ self._configure_distributed_model(model) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1201, in _configure_distributed_model self._broadcast_model() File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1120, in _broadcast_model dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, **kwargs) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 224, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/work/.env/tel/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(*args, **kwargs) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 200, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/work/.env/tel/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/work/.env/tel/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = group.broadcast([tensor], opts) RuntimeError: Tensors must be contiguous 我将"hf1/chinese-roberta-wxm-ext"训练中的default_fsdp.yaml换成了deepspeed_zero2.yaml,会出现上述bug 在embedding的训练中替换yaml,会出现 ValueError: When using DeepSpeed,accelerate.prepare()requires you to pass at least one of training or evaluation dataloaders withbatch_sizeattribute returning an integer value or alternatively set an integer value intrain_micro_batch_size_per_gpuin the deepspeed config file or assign integer value toAcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu'].

liu-yx17 commented 21 hours ago

抱歉,第一个问题是deepspeed的问题,已解决。embedding训练的bug还未解决

NLPJCL commented 21 hours ago

https://github.com/NLPJCL/RAG-Retrieval/blob/a60906d1ee6f3ec7242a516b9989e36a793873ab/rag_retrieval/train/reranker/train_reranker.py#L137

可以参考reranker改下embedding的部分。 就是如提示所说:accelerate.prepare()把dataloaders也传递过去。

liu-yx17 commented 20 hours ago

合并train_embedding.py 117行和158行的包装,已解决。