OpenMOSS / CoLLiE

Collaborative Training of Large Language Models in an Efficient Way
https://openlmlab-collie.readthedocs.io
Apache License 2.0
410 stars 58 forks source link

lomo训练65b llama实测 Lomo is incompatible with pipeline parallelism #152

Open zlh1992 opened 8 months ago

zlh1992 commented 8 months ago

配置: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 train.py config.tp_size = 1 config.dp_size = 1 # 8 无所谓 config.pp_size = 1 config.train_epochs = 1 config.eval_per_n_steps = 0 config.eval_per_n_epochs = 1 config.train_micro_batch_size = 1 config.eval_batch_size = 1 config.ds_config = { "fp16": { "enabled": True }, "zero_allow_untested_optimizer": True, "zero_force_ds_cpu_optimizer": False, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": False } } } 8张A100 每张消耗在30gb左右 内存消耗130GB 加载模型offload峰值约400GB内存

想做pp试试,修改配置如下: config.tp_size = 4 config.dp_size = 1 config.pp_size = 2

collie/module.py中需要修改: self.parts = [int(i) for i in self.parts] os.environ["COLLIE_PP_PARTS"] = json.dumps(self.parts)

目前发现现在还不支持:Lomo is incompatible with pipeline parallelism

KaiLv69 commented 8 months ago

你好,因为fused_backward过程,lomo不支持1f1b调度的流水线并行。 8张A100 每张消耗在30gb左右确实有点高,可能deepspeed的通信buffer占掉了许多显存。 lomo和deepspeed的offload兼容性还没测试,不知道其结果怎么样,也不确定有没有真正offload到cpu上。