Open DietDietDiet opened 9 months ago
Great choice. Work in progress!
Can I use the latest code to test just by modifying version with 'v1' with a trained llava model?
On Thu, Feb 1, 2024 at 1:21 PM lb203 @.***> wrote:
Great choice. Work in progress!
— Reply to this email directly, view it on GitHub https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/9#issuecomment-1920536600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDRFRF53G2L66KCXG7FANLYRMQ6PAVCNFSM6AAAAABCUGO6RWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRQGUZTMNRQGA . You are receiving this because you authored the thread.Message ID: @.***>
I think this should work.
Any ways to insert MOE layers only in part of LLM layers? I found that modifying all layers in 13b cannot fit into 40G A100
Any ways to insert MOE layers only in part of LLM layers? I found that modifying all layers in 13b cannot fit into 40G A100
For example, if you want to inset MoE layers in the first and third layer, you can pass --moe_layers_idx 0 2
in your command.
I used pretrained llava to initialize moe-llava, and passed in moe_layers_idx params, and encountered the following error.
AssertionError: The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer
Any additional modifications to solve this?
On Sat, Feb 3, 2024 at 10:32 AM lb203 @.***> wrote:
Closed #9 https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/9 as completed.
— Reply to this email directly, view it on GitHub https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/9#event-11691263825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDRFRH2F3NAY6RCSAG5WALYRWOU7AVCNFSM6AAAAABCUGO6RWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGY4TCMRWGM4DENI . You are receiving this because you authored the thread.Message ID: @.***>
I used pretrained llava to initialize moe-llava, and passed in moe_layers_idx params, and encountered the following error. AssertionError: The model has moe layers, but None of the param groups are marked as MoE. Create a param group with 'moe' key set to True before creating optimizer Any additional modifications to solve this? … On Sat, Feb 3, 2024 at 10:32 AM lb203 @.> wrote: Closed #9 <#9> as completed. — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDRFRH2F3NAY6RCSAG5WALYRWOU7AVCNFSM6AAAAABCUGO6RWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGY4TCMRWGM4DENI . You are receiving this because you authored the thread.Message ID: @.>
Here is the solution. https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/17
I found it really weird that even I set minimal num_of experts & moe layers, moe-llama still cannot fit into 40G A100, here is the trainable modules I modified according to llama. --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg \ Could u provide a sample script for the final moe stage for llava1.5?
I found it really weird that even I set minimal num_of experts & moe layers, moe-llama still cannot fit into 40G A100, here is the trainable modules I modified according to llama. --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg \ Could u provide a sample script for the final moe stage for llava1.5?
You can enable the flash_attn2, and try it again. Refer to this issue. https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/25#issuecomment-1926419338
Btw, how many GPUs you use?
modified
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation="flash_attention_2", **kwargs)
in builder.py,
still OOM, I 'm using 8*40G A100.
Could you post your command?
modified
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation="flash_attention_2", **kwargs)
in builder.py, still OOM, I 'm using 8*40G A100.
`moe_mode="sparse" num_experts=1 top_k_experts=1 use_residual=False router_aux_loss_coef=0.01 JSON_FOLDER="ft_json" IMAGE_FOLDER="train_image_video"
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py \ --moe_enable False --num_experts ${num_experts} --top_k_experts ${top_k_experts} --capacity_factor 1.5 \ --moe_layers_idx 0 5 10 \ --moe_mode ${moe_mode} --use_residual ${use_residual} --router_aux_loss_coef ${router_aux_loss_coef} \ --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg \ --deepspeed ./scripts/zero2.json \ --model_name_or_path $(pretrained llava weight) \ --version v1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 16 \ --gradient_accumulation_steps 16 `
The rest remains consistent with llava
`moe_mode="sparse" num_experts=1 top_k_experts=1 use_residual=False router_aux_loss_coef=0.01 JSON_FOLDER="ft_json" IMAGE_FOLDER="train_image_video"
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py --moe_enable False --num_experts ${num_experts} --top_k_experts ${top_k_experts} --capacity_factor 1.5 --moe_layers_idx 0 5 10 --moe_mode ${moe_mode} --use_residual ${use_residual} --router_aux_loss_coef ${router_aux_loss_coef} --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg --deepspeed ./scripts/zero2.json --model_name_or_path $(pretrained llava weight) --version v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --gradient_accumulation_steps 16 `
The rest remains consistent with llava
We will check it later. Could you try other model, such as phi or stablelm?
ok, but the point for me is to test the result for an extra moe stage for a trained model, so I am currently working on my trained llava.
On Tue, Feb 6, 2024 at 9:47 PM lb203 @.***> wrote:
`moe_mode="sparse" num_experts=1 top_k_experts=1 use_residual=False router_aux_loss_coef=0.01 JSON_FOLDER="ft_json" IMAGE_FOLDER="train_image_video"
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed moellava/train/train_mem.py --moe_enable False --num_experts ${num_experts} --top_k_experts ${top_k_experts} --capacity_factor 1.5 --moe_layers_idx 0 5 10 --moe_mode ${moe_mode} --use_residual ${use_residual} --router_aux_loss_coef ${router_aux_loss_coef} --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg --deepspeed ./scripts/zero2.json --model_name_or_path $(pretrained llava weight) --version v1 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --gradient_accumulation_steps 16 `
The rest remains consistent with llava
We will check it later. Could you try other model, such as phi or stablelm?
— Reply to this email directly, view it on GitHub https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/9#issuecomment-1929665684, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDRFRCUHB5XPHIXJX3A75DYSIX6HAVCNFSM6AAAAABCUGO6RWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZGY3DKNRYGQ . You are receiving this because you authored the thread.Message ID: @.***>
Hi, have you tested the result for llava_llama version? Would an extra moe stage improve original llava results?