Open awzhgw opened 9 months ago
这是第二阶段还是第三阶段?有MoE的第三阶段不能用zero3。你可以用zero2_offload
来代替zero3
以支持更大的batch size。另外卡住似乎是因为deepspeed的问题,请参考:
[En] Is this stage 2 or stage 3? Stage 3 with MoE doesn't work with zero3. You can use zero2_offload
instead of zero3
to support larger batch size. The stuckness seems to be due to deepspeed, please refer to that:
@LinB203 zero2 offload 跑mixtral 7Bx8 会导致OOM ,
@LinB203 https://github.com/PKU-YuanGroup/Video-LLaVA/issues/48 这个PR里面的deepspeed 3的问题,其已经修复了,并且合并到Master分支,我就是用master分支测试的。
@LinB203 zero2 offload 跑mixtral 7Bx8 会导致OOM ,
@LinB203 PKU-YuanGroup/Video-LLaVA#48 这个PR里面的deepspeed 3的问题,其已经修复了,并且合并到Master分支,我就是用master分支测试的。
你能用zero3跑MoE? [En] You can run deepspeed's MoE with zero3?
@LinB203 zero2 offload 跑mixtral 7Bx8 会导致OOM , @LinB203 PKU-YuanGroup/Video-LLaVA#48 这个PR里面的deepspeed 3的问题,其已经修复了,并且合并到Master分支,我就是用master分支测试的。
你能用zero3跑MoE? [En] You can run deepspeed's MoE with zero3?
你的意思是:直接跑finetine_moe.sh吗???
当我切换到了zero2_offload的时候,也是同样的问题,,跑270个step后会卡住。。 但是奇怪的是:当我去掉视频数据后,跑的完全就没有问题了。。这是为啥呢?
@LinB203 zero2 offload 跑mixtral 7Bx8 会导致OOM , @LinB203 PKU-YuanGroup/Video-LLaVA#48 这个PR里面的deepspeed 3的问题,其已经修复了,并且合并到Master分支,我就是用master分支测试的。
你能用zero3跑MoE? [En] You can run deepspeed's MoE with zero3?
但是我可以用deepspeed3直接跑mixtral 7Bx8的模型。。这个我已经验证过了,代码如下,跑的是没有问题的。
import argparse
import deepspeed
import torch
from datasets import load_dataset
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, get_linear_schedule_with_warmup, set_seed
from accelerate import Accelerator, DistributedType
from torch.utils.data import Dataset
from accelerate.utils import DummyOptim, DummyScheduler, set_seed
import math
from accelerate.utils import DeepSpeedPlugin, FullyShardedDataParallelPlugin
from transformers import get_scheduler
from deepspeed.utils import set_z3_leaf_modules,get_z3_leaf_modules # mixtra;
from deepspeed.accelerator import get_accelerator
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock, MixtralForCausalLM
from transformers.integrations import is_deepspeed_zero3_enabled
MAX_GPU_BATCH_SIZE = 4
class RandomDataset(Dataset):
def __init__(self, num_samples: int = 1000, max_length: int = 2048, vocab_size: int = 100, tokenizer=None):
self.num_samples = num_samples
self.max_length = max_length
self.input_ids = torch.randint(2, vocab_size, (num_samples, max_length))
self.attention_mask = torch.ones_like(self.input_ids)
def __len__(self):
return self.num_samples
def __getitem__(self, idx):
return {
"input_ids": self.input_ids[idx],
"attention_mask": self.attention_mask[idx],
"labels": self.input_ids[idx],
}
def training_function(args):
get_accelerator().set_device(args.local_rank)
# Initialize accelerator
deepPlugin = DeepSpeedPlugin(hf_ds_config=args.conf, zero3_init_flag=True)
accelerator = Accelerator(mixed_precision='bf16', deepspeed_plugin=deepPlugin, gradient_accumulation_steps=1)
# Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
lr = 2e-5
num_epochs = 2000000
seed = 42
batch_size = 16
warmup_ratio = 0.03
model_id = args.model_path
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = RandomDataset(num_samples=10000, tokenizer=tokenizer)
train_dataloader = DataLoader(
dataset, shuffle=True, collate_fn=None, batch_size=batch_size, drop_last=True
)
if accelerator.is_main_process:
print(f'before prepare dataloader len: {len(train_dataloader)}')
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / accelerator.gradient_accumulation_steps)
max_train_steps = num_epochs * num_update_steps_per_epoch
config = AutoConfig.from_pretrained(model_id) #
config.num_hidden_layers = 1
model = AutoModelForCausalLM.from_pretrained(
model_id,
config=config,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=(not is_deepspeed_zero3_enabled())
)
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
model.enable_input_require_grads()
model.config.use_cache = False # turn off when gradient checkpointing is enabled
print("Gradient checkpointing enabled.")
set_z3_leaf_modules(model, [MixtralSparseMoeBlock]) # z3_leaf
print('get z3_leaf_module is ', get_z3_leaf_modules(model))
model.train() #
optimizer_cls = (
torch.optim.AdamW
if accelerator.state.deepspeed_plugin is None
or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
else DummyOptim
)
optimizer = optimizer_cls(params=model.parameters(), lr=lr)
if (
accelerator.state.deepspeed_plugin is None
or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
):
lr_scheduler = get_scheduler(
name='linear',
optimizer=optimizer,
num_warmup_steps=math.ceil(max_train_steps * warmup_ratio),
num_training_steps=max_train_steps,
)
else:
lr_scheduler = DummyScheduler(
optimizer, total_num_steps=max_train_steps, warmup_num_steps=math.ceil(max_train_steps * warmup_ratio)
)
model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, lr_scheduler
)
# Now we train the model
for epoch in range(num_epochs):
for step, batch in enumerate(train_dataloader):
with accelerator.accumulate(model):
model.train()
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
if accelerator.is_main_process and step % 10 == 0:
print(f'epoch: {epoch}, step: {step}, loss: {loss.item()}')
def main():
parser = argparse.ArgumentParser(description="Simple example of training script.")
parser.add_argument(
"--model_path",
type=str,
default="/export/App/training_platform/PinoModel/Mixtral-8x7B-Instruct-v0.1", )
parser.add_argument(
"--mixed_precision",
type=str,
default="bf16",
choices=["no", "fp16", "bf16", "fp8"],
help="Whether to use mixed precision. Choose"
"between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
"and an Nvidia Ampere GPU.",
)
parser.add_argument(
"--conf",
type=str,
default="./scripts/ds_conf.json",
)
parser.add_argument(
"--local_rank",
type=int,
default=-1,
)
args = parser.parse_args()
training_function(args)
if __name__ == "__main__":
main()
我们MoE的实现和HF的mixtral的实现不一样。Deepspeed的MoE只能zero2。 However, our MoE implement is different with HF mixtral. The MoE implemented by deepspeed can not run with zero3.
@LinB203 可是,当我替换了后端的模型为mixtral 7Bx8的时候,, 为啥删掉视频数据,就能正常跑了??如果里面全部是图片的数据就没有问题呢?[git@github.com:awzhgw/MoE-LLaVA.git ](https://github.com/awzhgw/MoE-LLaVA.git) 这是我的仓库代码地址
另外,当我替换pretrain.sh里面,切换到zero2.json 和zero2_offload.json后,跑了270个step,均会卡住。然后报错NCCL超时 (视频数据和图片数据都有)
但是当我去掉视频数据后,就能正常跑。
当我集成了LanguageBind_Video_merge 模型的时候,在训练的过程中,发现了hang的现象
同时过了30分钟,则报错:NCCL 超时。。 同时去掉视频相关数据,则训练一切正常