Open liu-yx17 opened 22 hours ago
其实只需要把config文件更换了就行。 ../../../config/default_fsdp.yaml 更换为: ../../../config/deepspeed/deepspeed_zero2.yaml reranker和embedding同理。
[ERROR] 2024-11-27-19:44:37 (PID:1509814, Device:2, RankID:2) ERR02002 DIST invalid type result = self._prepare_deepspeed(*args) File "/work/home/tel/tel/accelerate/src/accelerate/accelerator.py", line 1851, in _prepare_deepspeed engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/__init__.py", line 193, in initialize engine = DeepSpeedEngine(args=args, File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 269, in __init__ self._configure_distributed_model(model) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1201, in _configure_distributed_model self._broadcast_model() File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1120, in _broadcast_model dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(*args, **kwargs) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 224, in broadcast return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/work/.env/tel/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(*args, **kwargs) File "/work/.env/tel/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 200, in broadcast return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/work/.env/tel/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/work/.env/tel/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = group.broadcast([tensor], opts) RuntimeError: Tensors must be contiguous
我将"hf1/chinese-roberta-wxm-ext"训练中的default_fsdp.yaml换成了deepspeed_zero2.yaml,会出现上述bug
在embedding的训练中替换yaml,会出现
ValueError: When using DeepSpeed,
accelerate.prepare()requires you to pass at least one of training or evaluation dataloaders with
batch_sizeattribute returning an integer value or alternatively set an integer value in
train_micro_batch_size_per_gpuin the deepspeed config file or assign integer value to
AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu'].
抱歉,第一个问题是deepspeed的问题,已解决。embedding训练的bug还未解决
可以参考reranker改下embedding的部分。 就是如提示所说:accelerate.prepare()把dataloaders也传递过去。
合并train_embedding.py 117行和158行的包装,已解决。
1.对于reranker训练,llm类模型默认使用deepspeed训练,bert类模型默认使用fsdp训练,请问bert类模型如何使用deepspeed训练呢,能否增加使用案例? 2.或者对于embedding训练,如何使用deepspeed训练呢,是否也可以增加使用案例呢?谢谢