PKU-YuanGroup / MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models
https://arxiv.org/abs/2401.15947
Apache License 2.0
1.91k stars 121 forks source link

Is llavallama moe supported? #9

Open DietDietDiet opened 7 months ago

DietDietDiet commented 7 months ago

Hi, have you tested the result for llava_llama version? Would an extra moe stage improve original llava results?

LinB203 commented 7 months ago

Great choice. Work in progress!

DietDietDiet commented 7 months ago

Can I use the latest code to test just by modifying version with 'v1' with a trained llava model?

On Thu, Feb 1, 2024 at 1:21 PM lb203 @.***> wrote:

Great choice. Work in progress!

— Reply to this email directly, view it on GitHub https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/9#issuecomment-1920536600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDRFRF53G2L66KCXG7FANLYRMQ6PAVCNFSM6AAAAABCUGO6RWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRQGUZTMNRQGA . You are receiving this because you authored the thread.Message ID: @.***>

LinB203 commented 7 months ago

I think this should work.

DietDietDiet commented 7 months ago

Any ways to insert MOE layers only in part of LLM layers? I found that modifying all layers in 13b cannot fit into 40G A100

LinB203 commented 7 months ago

Any ways to insert MOE layers only in part of LLM layers? I found that modifying all layers in 13b cannot fit into 40G A100

For example, if you want to inset MoE layers in the first and third layer, you can pass --moe_layers_idx 0 2 in your command.

DietDietDiet commented 7 months ago

I used pretrained llava to initialize moe-llava, and passed in moe_layers_idx params, and encountered the following error.

AssertionError: The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer

Any additional modifications to solve this?

On Sat, Feb 3, 2024 at 10:32 AM lb203 @.***> wrote:

Closed #9 https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/9 as completed.

— Reply to this email directly, view it on GitHub https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/9#event-11691263825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDRFRH2F3NAY6RCSAG5WALYRWOU7AVCNFSM6AAAAABCUGO6RWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGY4TCMRWGM4DENI . You are receiving this because you authored the thread.Message ID: @.***>

LinB203 commented 7 months ago

I used pretrained llava to initialize moe-llava, and passed in moe_layers_idx params, and encountered the following error. AssertionError: The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer Any additional modifications to solve this? On Sat, Feb 3, 2024 at 10:32 AM lb203 @.> wrote: Closed #9 <#9> as completed. — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDRFRH2F3NAY6RCSAG5WALYRWOU7AVCNFSM6AAAAABCUGO6RWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGY4TCMRWGM4DENI . You are receiving this because you authored the thread.Message ID: @.>

Here is the solution. https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/17

DietDietDiet commented 7 months ago

I found it really weird that even I set minimal num_of experts & moe layers, moe-llama still cannot fit into 40G A100, here is the trainable modules I modified according to llama. --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg \ Could u provide a sample script for the final moe stage for llava1.5?

LinB203 commented 7 months ago

I found it really weird that even I set minimal num_of experts & moe layers, moe-llama still cannot fit into 40G A100, here is the trainable modules I modified according to llama. --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg \ Could u provide a sample script for the final moe stage for llava1.5?

You can enable the flash_attn2, and try it again. Refer to this issue. https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/25#issuecomment-1926419338

Btw, how many GPUs you use?

DietDietDiet commented 7 months ago

modified model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation="flash_attention_2", **kwargs) in builder.py, still OOM, I 'm using 8*40G A100.

LinB203 commented 7 months ago

Could you post your command?

modified model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation="flash_attention_2", **kwargs) in builder.py, still OOM, I 'm using 8*40G A100.

DietDietDiet commented 7 months ago

`moe_mode="sparse" num_experts=1 top_k_experts=1 use_residual=False router_aux_loss_coef=0.01 JSON_FOLDER="ft_json" IMAGE_FOLDER="train_image_video"

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \ --moe_enable False --num_experts ${num_experts} --top_k_experts ${top_k_experts} --capacity_factor 1.5 \ --moe_layers_idx 0 5 10 \ --moe_mode ${moe_mode} --use_residual ${use_residual} --router_aux_loss_coef ${router_aux_loss_coef} \ --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg \ --deepspeed ./scripts/zero2.json \ --model_name_or_path $(pretrained llava weight) \ --version v1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 16 \ --gradient_accumulation_steps 16 `

The rest remains consistent with llava

LinB203 commented 7 months ago

`moe_mode="sparse" num_experts=1 top_k_experts=1 use_residual=False router_aux_loss_coef=0.01 JSON_FOLDER="ft_json" IMAGE_FOLDER="train_image_video"

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py --moe_enable False --num_experts ${num_experts} --top_k_experts ${top_k_experts} --capacity_factor 1.5 --moe_layers_idx 0 5 10 --moe_mode ${moe_mode} --use_residual ${use_residual} --router_aux_loss_coef ${router_aux_loss_coef} --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg --deepspeed ./scripts/zero2.json --model_name_or_path $(pretrained llava weight) --version v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --gradient_accumulation_steps 16 `

The rest remains consistent with llava

We will check it later. Could you try other model, such as phi or stablelm?

DietDietDiet commented 7 months ago

ok, but the point for me is to test the result for an extra moe stage for a trained model, so I am currently working on my trained llava.

On Tue, Feb 6, 2024 at 9:47 PM lb203 @.***> wrote:

`moe_mode="sparse" num_experts=1 top_k_experts=1 use_residual=False router_aux_loss_coef=0.01 JSON_FOLDER="ft_json" IMAGE_FOLDER="train_image_video"

HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py --moe_enable False --num_experts ${num_experts} --top_k_experts ${top_k_experts} --capacity_factor 1.5 --moe_layers_idx 0 5 10 --moe_mode ${moe_mode} --use_residual ${use_residual} --router_aux_loss_coef ${router_aux_loss_coef} --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg --deepspeed ./scripts/zero2.json --model_name_or_path $(pretrained llava weight) --version v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --gradient_accumulation_steps 16 `

The rest remains consistent with llava

We will check it later. Could you try other model, such as phi or stablelm?

— Reply to this email directly, view it on GitHub https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/9#issuecomment-1929665684, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDRFRCUHB5XPHIXJX3A75DYSIX6HAVCNFSM6AAAAABCUGO6RWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZGY3DKNRYGQ . You are receiving this because you authored the thread.Message ID: @.***>