Open awzhgw opened 8 months ago
@LinB203 我记得video llava 默认使用的是16 精度进行预训练的。。。我能改成4精度进行训练吗?进而减少显存的占用。
当前许多人反馈OOM,然而我重新拉取代码并没有出现这个问题。我感觉可能是系统环境有区别。我正在排查这个问题。 [En] Currently a lot of people are giving feedback on OOM, however I'm not having this problem by re-pulling the code. I'm speculating that there may be a difference in the system environment. I am trying to resolve this issue.
I uploaded zero2_offload.json
, you can try --deepspeed . /scripts/zero2_offload.json
, feel free to let me know of any updates.
@LinB203 我是用的是mixtral 7Bx8的模型哈。。我改造了vedio llava的代码,让他们适配Mixtral 7BX8的模型。。因此造成了OOM ,但是我用vicunal 7B的模型,没有OOM的问题。。方便加个微信吗?帅哥。
@LinB203 使用zero2_offload.json ,依旧崩溃了:
Traceback (most recent call last):
File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train_mem.py", line 21, in
你用了几个GPU?看样子是在deepspeed 初始化时候就崩溃了,一般这种情况和模型无关,所以batch size=1也不会改变结果。 [En] How many GPUs are you using? it looks like it crashes during deepspeed initialization, which is generally model-independent, so batch size=1 won't change the results.
w我的H800 ,是8个GPU哈。。。启动脚本是这样写的:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup deepspeed llava/train/train_mem.py \ --deepspeed ./scripts/zero2_offload.json \ --model_name_or_path ${INPUT_MODEL_PATH} \ --version mixtral \ --data_path ${DATA_ROOT}/train_json/pretrain/valley_llavaimage.json \ --video_folder ${DATA_ROOT} \ --image_folder ${DATA_ROOT} \ --X "Video" "Image" \ --video_tower ${VIDEO_TOWER_PATH} \ --image_tower ${IMAGE_TOWER_PATH} \ --mm_projector_type mlp2x_gelu \ --tune_mm_mlp_adapter True \ --mm_vision_select_layer -2 \ --mm_use_x_start_end False \ --mm_use_x_patch_token False \ --bf16 True \ --output_dir ${ChubaoFS_ROOT}/omni/checkpoint/omni-LLaVA-Pretrain-7B \ --num_train_epochs 1 \ --per_device_train_batch_size 32 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 24000 \ --save_total_limit 1 \ --learning_rate 1e-3 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to tensorboard \
@LinB203 Mixtral 8x7B 模型需要 100GB 左右显存. 但是我是H800的显卡。。vedio llava框架支持 模型并行吗? 比如: 0,1 显卡加载一个Mixtral 8x7B 模型, 2,3号显卡加载一个mixtral 8x7B 模型, 4,5号显卡加载一个mixtral 8x7B 模型,6,7号显卡加载一个mixtral 8x7B 模型
@LinB203 mixtral 8x7B model need 100GB GPU memory , i want adopt llava on mixtral 8x7B ,but h800 gpu only has 80GB
so, deepspeed ../scripts/zero2.json OOM
deepspeed ../scripts/zero3.json can run ,but very slow ,very slow .
how resolve it ?
how about zero2_offload?
@LinB203 Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.9910638332366943 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8254106044769287 seconds
Traceback (most recent call last):
File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train_mem.py", line 21, in
这是zero2_offload.json 依旧崩溃。
However this is not a problem with Video-LLaVA. Maybe Mixtral-MoE just needs zero3 to run. Maybe compressing the video tokens can speed up.
@LinB203 it is a deepspeed on mixtral bug, it may be:https://github.com/hiyouga/LLaMA-Factory/issues/1998
Hi, we reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh. But unfortunately we still can't use zero3, and we suspect that deepspeed doesn't support the load imbalance between GPUs very well.
Hi, we reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh. But unfortunately we still can't use zero3, and we suspect that deepspeed doesn't support the load imbalance between GPUs very well.
Have you been able to fix zero 3, having some error "get_peft_model()"
i want use video-llava framework use mixtral-7Bx8 的大模型进行训练
改造完成后存在如下问题:
那是因为mixtral 7Bx8 有大约46B 参数,而vicnue 7B只有 7B参数。。 那么我该怎么解决呢?
我能在预训练阶段:使用4bit 进行预训练,来解决这个问题吗?