i want use video-llava framework use mixtral-7Bx8 的大模型进行训练

awzhgw commented 8 months ago

改造完成后存在如下问题：

现存不足。。使用h800的现存，跑 vedio-llava on mixtral 7bx8的模型, 报错：显存不足。。

那是因为mixtral 7Bx8 有大约46B 参数，而vicnue 7B只有 7B参数。。那么我该怎么解决呢？

我能在预训练阶段：使用4bit 进行预训练，来解决这个问题吗？

awzhgw commented 8 months ago

@LinB203 我记得video llava 默认使用的是16 精度进行预训练的。。。我能改成4精度进行训练吗？进而减少显存的占用。

LinB203 commented 8 months ago

当前许多人反馈OOM，然而我重新拉取代码并没有出现这个问题。我感觉可能是系统环境有区别。我正在排查这个问题。 [En] Currently a lot of people are giving feedback on OOM, however I'm not having this problem by re-pulling the code. I'm speculating that there may be a difference in the system environment. I am trying to resolve this issue.

LinB203 commented 8 months ago

I uploaded zero2_offload.json, you can try --deepspeed . /scripts/zero2_offload.json, feel free to let me know of any updates.

awzhgw commented 8 months ago

@LinB203 我是用的是mixtral 7Bx8的模型哈。。我改造了vedio llava的代码，让他们适配Mixtral 7BX8的模型。。因此造成了OOM ，但是我用vicunal 7B的模型，没有OOM的问题。。方便加个微信吗？帅哥。

awzhgw commented 8 months ago

@LinB203 使用zero2_offload.json ,依旧崩溃了：

Traceback (most recent call last): File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train_mem.py", line 21, in train() File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train.py", line 1193, in train trainer.train() File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) File "/opt/conda/envs/omni/lib/python3.10/site-packages/accelerate/accelerator.py", line 1198, in prepare result = self._prepare_deepspeed(*args) File "/opt/conda/envs/omni/lib/python3.10/site-packages/accelerate/accelerator.py", line 1537, in _preparedeepspeed engine, optimizer, , lr_scheduler = deepspeed.initialize(*kwargs) File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 267, in init self._configure_distributed_model(model) File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1048, in _configure_distributed_model self.module.to(self.device) File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to return super().to(args, **kwargs) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 4 has a total capacty of 79.11 GiB of which 18.69 MiB is free. Process 158998 has 79.08 GiB memory in use. Of the allocated memory 78.33 GiB is allocated by PyTorch, and 244.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-01-08 13:34:24,546] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1421135

LinB203 commented 8 months ago

你用了几个GPU？看样子是在deepspeed 初始化时候就崩溃了，一般这种情况和模型无关，所以batch size=1也不会改变结果。 [En] How many GPUs are you using? it looks like it crashes during deepspeed initialization, which is generally model-independent, so batch size=1 won't change the results.

awzhgw commented 8 months ago

w我的H800 ，是8个GPU哈。。。启动脚本是这样写的：

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup deepspeed llava/train/train_mem.py \ --deepspeed ./scripts/zero2_offload.json \ --model_name_or_path ${INPUT_MODEL_PATH} \ --version mixtral \ --data_path ${DATA_ROOT}/train_json/pretrain/valley_llavaimage.json \ --video_folder ${DATA_ROOT} \ --image_folder ${DATA_ROOT} \ --X "Video" "Image" \ --video_tower ${VIDEO_TOWER_PATH} \ --image_tower ${IMAGE_TOWER_PATH} \ --mm_projector_type mlp2x_gelu \ --tune_mm_mlp_adapter True \ --mm_vision_select_layer -2 \ --mm_use_x_start_end False \ --mm_use_x_patch_token False \ --bf16 True \ --output_dir ${ChubaoFS_ROOT}/omni/checkpoint/omni-LLaVA-Pretrain-7B \ --num_train_epochs 1 \ --per_device_train_batch_size 32 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 24000 \ --save_total_limit 1 \ --learning_rate 1e-3 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \

awzhgw commented 8 months ago

@LinB203 Mixtral 8x7B 模型需要 100GB 左右显存. 但是我是H800的显卡。。vedio llava框架支持模型并行吗？比如： 0,1 显卡加载一个Mixtral 8x7B 模型， 2，3号显卡加载一个mixtral 8x7B 模型， 4，5号显卡加载一个mixtral 8x7B 模型，6，7号显卡加载一个mixtral 8x7B 模型

awzhgw commented 8 months ago

@LinB203 mixtral 8x7B model need 100GB GPU memory , i want adopt llava on mixtral 8x7B ,but h800 gpu only has 80GB

so, deepspeed ../scripts/zero2.json OOM

deepspeed ../scripts/zero3.json can run ,but very slow ,very slow .

how resolve it ?

LinB203 commented 8 months ago

how about zero2_offload?

awzhgw commented 8 months ago

@LinB203 Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... [WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled! Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.9910638332366943 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.8254106044769287 seconds Traceback (most recent call last): File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train_mem.py", line 21, in train() File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train.py", line 1191, in train trainer.train() File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) File "/opt/conda/envs/omni/lib/python3.10/site-packages/accelerate/accelerator.py", line 1198, in prepare result = self._prepare_deepspeed(*args) File "/opt/conda/envs/omni/lib/python3.10/site-packages/accelerate/accelerator.py", line 1537, in _preparedeepspeed engine, optimizer, , lr_scheduler = deepspeed.initialize(*kwargs) File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 267, in init self._configure_distributed_model(model) File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1048, in _configure_distributed_model self.module.to(self.device) File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to return super().to(args, **kwargs) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 2 has a total capacty of 79.11 GiB of which 18.69 MiB is free. Process 3425768 has 79.08 GiB memory in use. Of the allocated memory 78.33 GiB is allocated by PyTorch, and 244.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-01-08 16:17:55,409] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648081 [2024-01-08 16:18:04,889] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648082 [2024-01-08 16:18:08,382] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648083 [2024-01-08 16:18:08,383] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648084 [2024-01-08 16:18:13,057] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648085

这是zero2_offload.json 依旧崩溃。

LinB203 commented 8 months ago

However this is not a problem with Video-LLaVA. Maybe Mixtral-MoE just needs zero3 to run. Maybe compressing the video tokens can speed up.

awzhgw commented 8 months ago

@LinB203 it is a deepspeed on mixtral bug, it may be:https://github.com/hiyouga/LLaMA-Factory/issues/1998

LinB203 commented 7 months ago

Hi, we reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh. But unfortunately we still can't use zero3, and we suspect that deepspeed doesn't support the load imbalance between GPUs very well.

manushree635 commented 5 months ago

Hi, we reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh. But unfortunately we still can't use zero3, and we suspect that deepspeed doesn't support the load imbalance between GPUs very well.

Have you been able to fix zero 3, having some error "get_peft_model()"

PKU-YuanGroup / Video-LLaVA

i want use video-llava framework use mixtral-7Bx8 的大模型进行训练 #67