PKU-YuanGroup / MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models
https://arxiv.org/abs/2401.15947
Apache License 2.0
1.97k stars 125 forks source link

languageBindVideo model may be hang ? #21

Open awzhgw opened 9 months ago

awzhgw commented 9 months ago

当我集成了LanguageBind_Video_merge 模型的时候,在训练的过程中,发现了hang的现象

同时过了30分钟,则报错:NCCL 超时。。 同时去掉视频相关数据,则训练一切正常

root@A03-R40-I16-12-8000045:/export/App/training_platform/PinoModel# py-spy dump -p 3261644
Process 3261644: /usr/bin/python -u moellava/train/train_mem.py --local_rank=5 --deepspeed ./scripts/zero3_offload.json --model_name_or_path /export/App/training_platform/PinoModel/mixtral/Mixtral-8x7B-Instruct-v0.1 --version mixtral --data_path /mnt/moe/moe/dataset/data_root/train_json/pretrain/valley_llavaimage.json --image_folder /mnt/moe/moe/dataset/data_root --image_tower /export/App/training_platform/PinoModel/openai/clip-vit-large-patch14-336 --image_projector_type mlp2x_gelu --video_tower /export/App/training_platform/PinoModel/LanguageBind/LanguageBind_Video_merge --video_folder /mnt/moe/moe/dataset/data_root --tune_mm_mlp_adapter True --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir ./checkpoints/llavamixtral-7b-pretrain --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 2400 --save_total_limit 1 --learning_rate 1e-3 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 8 --lazy_preprocess True --report_to tensorboard --cache_dir ./cache_dir
Python v3.10.12 (/usr/bin/python3.10)

Thread 3261644 (active): "MainThread"
    <listcomp> (deepspeed/runtime/zero/partition_parameters.py:1138)
    _all_gather_dtype (deepspeed/runtime/zero/partition_parameters.py:1138)
    all_gather_coalesced (deepspeed/runtime/zero/partition_parameters.py:1252)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    __all_gather_params_ (deepspeed/runtime/zero/partitioned_param_coordinator.py:458)
    __all_gather_params (deepspeed/runtime/zero/partitioned_param_coordinator.py:429)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    fetch_sub_module (deepspeed/runtime/zero/partitioned_param_coordinator.py:380)
    decorate_context (torch/utils/_contextlib.py:115)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    pre_sub_module_forward_function (deepspeed/runtime/zero/parameter_offload.py:452)
    decorate_context (torch/utils/_contextlib.py:115)
    _pre_forward_module_hook (deepspeed/runtime/zero/parameter_offload.py:340)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    _call_impl (torch/nn/modules/module.py:1557)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (transformers/models/clip/modeling_clip.py:263)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (transformers/models/clip/modeling_clip.py:372)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (torch/utils/checkpoint.py:230)
    apply (torch/autograd/function.py:539)
    checkpoint (torch/utils/checkpoint.py:450)
    inner (torch/_dynamo/external_utils.py:17)
    _fn (torch/_dynamo/eval_frame.py:333)
    inner (torch/_compile.py:24)
    forward (transformers/models/clip/modeling_clip.py:622)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (transformers/models/clip/modeling_clip.py:844)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (transformers/models/clip/modeling_clip.py:917)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (clip_encoder.py:50)
    decorate_context (torch/utils/_contextlib.py:115)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    encode_images (moellava/model/llava_arch.py:152)
    prepare_inputs_labels_for_multimodal (moellava/model/llava_arch.py:198)
    forward (llava_mixtral.py:83)
    _call_impl (torch/nn/modules/module.py:1568)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    forward (deepspeed/runtime/engine.py:1842)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    _call_impl (torch/nn/modules/module.py:1527)
    _wrapped_call_impl (torch/nn/modules/module.py:1518)
    compute_loss (transformers/trainer.py:2795)
    training_step (transformers/trainer.py:2772)
    _inner_training_loop (transformers/trainer.py:1868)
    train (transformers/trainer.py:1539)
    train (train.py:1475)
    <module> (train_mem.py:13)
Thread 3262753 (idle): "Thread-1"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    wait_result_broken_or_wakeup (concurrent/futures/process.py:385)
    run (concurrent/futures/process.py:320)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3264158 (idle): "Thread-2"
    wait (threading.py:324)
    wait (threading.py:607)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3267395 (idle): "Thread-3 (_pin_memory_loop)"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:31)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:54)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268088 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268152 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268153 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268154 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268155 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268156 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268157 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3268158 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 3303923 (idle)
Thread 3303931 (idle)
Thread 3303916 (idle)
Thread 3303934 (idle)
Thread 3303942 (idle)
Thread 3303945 (idle)
Thread 3303952 (idle)
Thread 3303949 (idle)
LinB203 commented 9 months ago

这是第二阶段还是第三阶段?有MoE的第三阶段不能用zero3。你可以用zero2_offload来代替zero3以支持更大的batch size。另外卡住似乎是因为deepspeed的问题,请参考:

[En] Is this stage 2 or stage 3? Stage 3 with MoE doesn't work with zero3. You can use zero2_offload instead of zero3 to support larger batch size. The stuckness seems to be due to deepspeed, please refer to that:

https://github.com/PKU-YuanGroup/Video-LLaVA/issues/48

awzhgw commented 9 months ago

@LinB203 zero2 offload 跑mixtral 7Bx8 会导致OOM ,

@LinB203 https://github.com/PKU-YuanGroup/Video-LLaVA/issues/48 这个PR里面的deepspeed 3的问题,其已经修复了,并且合并到Master分支,我就是用master分支测试的。

LinB203 commented 9 months ago

@LinB203 zero2 offload 跑mixtral 7Bx8 会导致OOM ,

@LinB203 PKU-YuanGroup/Video-LLaVA#48 这个PR里面的deepspeed 3的问题,其已经修复了,并且合并到Master分支,我就是用master分支测试的。

你能用zero3跑MoE? [En] You can run deepspeed's MoE with zero3?

awzhgw commented 9 months ago

@LinB203 zero2 offload 跑mixtral 7Bx8 会导致OOM , @LinB203 PKU-YuanGroup/Video-LLaVA#48 这个PR里面的deepspeed 3的问题,其已经修复了,并且合并到Master分支,我就是用master分支测试的。

你能用zero3跑MoE? [En] You can run deepspeed's MoE with zero3?

你的意思是:直接跑finetine_moe.sh吗???

当我切换到了zero2_offload的时候,也是同样的问题,,跑270个step后会卡住。。 但是奇怪的是:当我去掉视频数据后,跑的完全就没有问题了。。这是为啥呢?

awzhgw commented 9 months ago

@LinB203 zero2 offload 跑mixtral 7Bx8 会导致OOM , @LinB203 PKU-YuanGroup/Video-LLaVA#48 这个PR里面的deepspeed 3的问题,其已经修复了,并且合并到Master分支,我就是用master分支测试的。

你能用zero3跑MoE? [En] You can run deepspeed's MoE with zero3?

但是我可以用deepspeed3直接跑mixtral 7Bx8的模型。。这个我已经验证过了,代码如下,跑的是没有问题的。

import argparse

import deepspeed
import torch
from datasets import load_dataset
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, get_linear_schedule_with_warmup, set_seed

from accelerate import Accelerator, DistributedType
from torch.utils.data import Dataset

from accelerate.utils import DummyOptim, DummyScheduler, set_seed

import math

from accelerate.utils import DeepSpeedPlugin, FullyShardedDataParallelPlugin

from transformers import get_scheduler

from deepspeed.utils import set_z3_leaf_modules,get_z3_leaf_modules  # mixtra;
from deepspeed.accelerator import get_accelerator

from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock, MixtralForCausalLM

from transformers.integrations import is_deepspeed_zero3_enabled

MAX_GPU_BATCH_SIZE = 4

class RandomDataset(Dataset):
    def __init__(self, num_samples: int = 1000, max_length: int = 2048, vocab_size: int = 100, tokenizer=None):
        self.num_samples = num_samples
        self.max_length = max_length
        self.input_ids = torch.randint(2, vocab_size, (num_samples, max_length))
        self.attention_mask = torch.ones_like(self.input_ids)

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "attention_mask": self.attention_mask[idx],
            "labels": self.input_ids[idx],
        }

def training_function(args):
    get_accelerator().set_device(args.local_rank)

    # Initialize accelerator
    deepPlugin = DeepSpeedPlugin(hf_ds_config=args.conf, zero3_init_flag=True)
    accelerator = Accelerator(mixed_precision='bf16', deepspeed_plugin=deepPlugin, gradient_accumulation_steps=1)

    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = 2e-5
    num_epochs = 2000000
    seed = 42
    batch_size = 16
    warmup_ratio = 0.03

    model_id = args.model_path

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    dataset = RandomDataset(num_samples=10000, tokenizer=tokenizer)
    train_dataloader = DataLoader(
        dataset, shuffle=True, collate_fn=None, batch_size=batch_size, drop_last=True
    )

    if accelerator.is_main_process:
        print(f'before prepare dataloader len: {len(train_dataloader)}')

    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / accelerator.gradient_accumulation_steps)
    max_train_steps = num_epochs * num_update_steps_per_epoch

    config = AutoConfig.from_pretrained(model_id)  #
    config.num_hidden_layers = 1
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        config=config,
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=(not is_deepspeed_zero3_enabled())
    )

    model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
    model.enable_input_require_grads()
    model.config.use_cache = False  # turn off when gradient checkpointing is enabled
    print("Gradient checkpointing enabled.")

    set_z3_leaf_modules(model, [MixtralSparseMoeBlock])  # z3_leaf
    print('get z3_leaf_module is ', get_z3_leaf_modules(model))
    model.train()  #

    optimizer_cls = (
        torch.optim.AdamW
        if accelerator.state.deepspeed_plugin is None
           or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
        else DummyOptim
    )

    optimizer = optimizer_cls(params=model.parameters(), lr=lr)

    if (
            accelerator.state.deepspeed_plugin is None
            or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
    ):
        lr_scheduler = get_scheduler(
            name='linear',
            optimizer=optimizer,
            num_warmup_steps=math.ceil(max_train_steps * warmup_ratio),
            num_training_steps=max_train_steps,
        )
    else:
        lr_scheduler = DummyScheduler(
            optimizer, total_num_steps=max_train_steps, warmup_num_steps=math.ceil(max_train_steps * warmup_ratio)
        )

    model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, lr_scheduler
    )

    # Now we train the model
    for epoch in range(num_epochs):
        for step, batch in enumerate(train_dataloader):
            with accelerator.accumulate(model):
                model.train()
                outputs = model(**batch)
                loss = outputs.loss
                accelerator.backward(loss)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
            if accelerator.is_main_process and step % 10 == 0:
                print(f'epoch: {epoch}, step: {step}, loss: {loss.item()}')

def main():
    parser = argparse.ArgumentParser(description="Simple example of training script.")
    parser.add_argument(
        "--model_path",
        type=str,
        default="/export/App/training_platform/PinoModel/Mixtral-8x7B-Instruct-v0.1", )
    parser.add_argument(
        "--mixed_precision",
        type=str,
        default="bf16",
        choices=["no", "fp16", "bf16", "fp8"],
        help="Whether to use mixed precision. Choose"
             "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
             "and an Nvidia Ampere GPU.",
    )
    parser.add_argument(
        "--conf",
        type=str,
        default="./scripts/ds_conf.json",
    )
    parser.add_argument(
        "--local_rank",
        type=int,
        default=-1,
    )
    args = parser.parse_args()
    training_function(args)

if __name__ == "__main__":
    main()
LinB203 commented 9 months ago

我们MoE的实现和HF的mixtral的实现不一样。Deepspeed的MoE只能zero2。 However, our MoE implement is different with HF mixtral. The MoE implemented by deepspeed can not run with zero3.

awzhgw commented 9 months ago

@LinB203 可是,当我替换了后端的模型为mixtral 7Bx8的时候,, 为啥删掉视频数据,就能正常跑了??如果里面全部是图片的数据就没有问题呢?[git@github.com:awzhgw/MoE-LLaVA.git ](https://github.com/awzhgw/MoE-LLaVA.git) 这是我的仓库代码地址

awzhgw commented 9 months ago

另外,当我替换pretrain.sh里面,切换到zero2.json 和zero2_offload.json后,跑了270个step,均会卡住。然后报错NCCL超时 (视频数据和图片数据都有)

但是当我去掉视频数据后,就能正常跑。