2U1 / Molmo-Finetune

An open-source implementaion for fine-tuning Molmo-7B-D and Molmo-7B-O by allenai.
Apache License 2.0
30 stars 3 forks source link

About merge_lora error #4

Open Algabeno opened 1 month ago

Algabeno commented 1 month ago

The following error occurred when I used the merge script.Is this correct approach to run merg lora sh ? python src/merge_lora_weights.py \ --model-path ./output/lora_vision_test\ --model-base $MODEL_NAME \ --save-model-path ./Molmo-7B-D-1009 \ Traceback (most recent call last): File "/root/autodl-tmp/Molmo-Finetune-master/src/merge_lora_weights.py", line 27, in merge_lora(args) File "/root/autodl-tmp/Molmo-Finetune-master/src/merge_lora_weights.py", line 6, in merge_lora processor, model = load_pretrained_model(model_path=args.model_path, model_base=args.model_base, File "/root/autodl-tmp/Molmo-Finetune-master/src/utils.py", line 45, in load_pretrained_model model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, trust_remote_code=True, **kwargs) File "/root/miniconda3/envs/molmo/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 557, in from_pretrained cls.register(config.class, model_class, exist_ok=True) File "/root/miniconda3/envs/molmo/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 584, in register raise ValueError( ValueError: The model class you are passing has a config_class attribute that is not consistent with the config class you passed (model has <class 'transformers_modules.Molmo-7B-D-0924.config_molmo.MolmoConfig'> and you passed <class 'transformers_modules.lora_vision_test.config_molmo.MolmoConfig'>. Fix one of those so they match!

2022DRXdeft commented 1 month ago

Hello, may I ask which dataset you used for fine-tuning? If possible, could you provide me with a link? Thank you very much!

Algabeno commented 1 month ago

Hello, may I ask which dataset you used for fine-tuning? If possible, could you provide me with a link? Thank you very much!

I used this dataset, but I used some pre-processing operations before training. You need to follow the LLaVA specification.

Algabeno commented 1 month ago

I tried to force the configuration file of model_base to load the weights of LoRA fine-tuning model to ensure that the same configuration class was used during loading and merging. lora_cfg_pretrained = AutoConfig.from_pretrained(model_base) but i get new error:

Traceback (most recent call last): File "/root/autodl-tmp/Molmo-Finetune-master/src/merge_lora_weights.py", line 27, in merge_lora(args) File "/root/autodl-tmp/Molmo-Finetune-master/src/merge_lora_weights.py", line 6, in merge_lora processor, model = load_pretrained_model(model_path=args.model_path, model_base=args.model_base, File "/root/autodl-tmp/Molmo-Finetune-master/src/utils.py", line 48, in load_pretrained_model token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features File "/root/miniconda3/envs/molmo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'MolmoForCausalLM' object has no attribute 'lm_head'

2022DRXdeft commented 1 month ago

Thank you very much for your help. I hope the author can reply to you as soon as possible. I am a beginner and may not be able to provide assistance with your question. Sorry.

2U1 commented 1 month ago

@Algabeno Sorry I've have an resource issue for now that I can't properly debug the problem. I'll look into it.

Okay I found out some other problem that Molmo has a different name for lm_head in other models. Its' ff_out. I'll change some other features and check if the proper config is saved.

2U1 commented 1 month ago

@Algabeno Also, can you share the config file that you've made with the code please.

Algabeno commented 1 month ago

@Algabeno Also, can you share the config file that you've made with the code please.

Thank you very much for your reply. Do you mean the config.json or adapter_config.json I generated after fine-tuning? adapter_config.json config.json

2022DRXdeft commented 1 month ago

@Algabeno Also, can you share the config file that you've made with the code please.

Thank you very much for your reply. Do you mean the config.json or adapter_config.json I generated after fine-tuning? adapter_config.json config.json

Sorry, I am a beginner. Do I need to add these configuration files myself in the original code? If you want to add it, where should you add it.

Algabeno commented 1 month ago

@Algabeno Also, can you share the config file that you've made with the code please.

Thank you very much for your reply. Do you mean the config.json or adapter_config.json I generated after fine-tuning? adapter_config.json config.json

Sorry, I am a beginner. Do I need to add these configuration files myself in the original code? If you want to add it, where should you add it.

Both files will be automatically generated after you finish finetune.

2022DRXdeft commented 1 month ago

@Algabeno Also, can you share the config file that you've made with the code please.

Thank you very much for your reply. Do you mean the config.json or adapter_config.json I generated after fine-tuning? adapter_config.json config.json

Sorry, I am a beginner. Do I need to add these configuration files myself in the original code? If you want to add it, where should you add it.

Both files will be automatically generated after you finish finetune.

Thank you very much for your patience!

Algabeno commented 1 month ago

@Algabeno Sorry I've have an resource issue for now that I can't properly debug the problem. I'll look into it.

Okay I found out some other problem that Molmo has a different name for lm_head in other models. Its' ff_out. I'll change some other features and check if the proper config is saved.

Thank you for reminding me. I have made the following changes to the code now, and I can run the code correctly. ` token_num, token_dim = ff_out_layer.out_features, ff_out_layer.in_features

    if ff_out_layer.weight.shape[0] != token_num:
        print(f"Resizing ff_out.weight from {ff_out_layer.weight.shape} to ({token_num}, {token_dim})")
        ff_out_layer.weight = torch.nn.Parameter(
            torch.empty(token_num, token_dim, device=model.device, dtype=model.dtype)
        )
        print(f"Resizing wte.weight from {wte_layer.weight.shape} to ({token_num}, {token_dim})")
        wte_layer.weight = torch.nn.Parameter(
            torch.empty(token_num, token_dim, device=model.device, dtype=model.dtype)
        )
        `
Algabeno commented 1 month ago

@Algabeno Sorry I've have an resource issue for now that I can't properly debug the problem. I'll look into it. Okay I found out some other problem that Molmo has a different name for lm_head in other models. Its' ff_out. I'll change some other features and check if the proper config is saved.

Thank you for reminding me. I have made the following changes to the code now, and I can run the code correctly. `

    token_num, token_dim = ff_out_layer.out_features, ff_out_layer.in_features
    if ff_out_layer.weight.shape[0] != token_num:
        print(f"Resizing ff_out.weight from {ff_out_layer.weight.shape} to ({token_num}, {token_dim})")
        ff_out_layer.weight = torch.nn.Parameter(
            torch.empty(token_num, token_dim, device=model.device, dtype=model.dtype)
        )
        print(f"Resizing wte.weight from {wte_layer.weight.shape} to ({token_num}, {token_dim})")
        wte_layer.weight = torch.nn.Parameter(
            torch.empty(token_num, token_dim, device=model.device, dtype=model.dtype)
        )
        `
2U1 commented 1 month ago

@Algabeno It seems Molmo can't bring the config from a local path. I'll try to fix this. I'm not sure but I think that could break the model. I'll try to make a better code for LoRA. Since the model is a preview version, these kind of issue really helps.

Thank you!

Algabeno commented 1 month ago

Thank you for your patient responses. After fine-tuning the model with my dataset, I found that the accuracy is relatively low. Could you give me some suggestions?

The dataset is a target detection dataset. The script I used is finetune_lora_vision.sh, and the script parameters are as follows:

deepspeed src/training/train.py \
    --lora_enable True \
    --vision_lora True \
    --lora_rank 64 \
    --lora_alpha 128 \
    --lora_dropout 0.05 \
    --num_lora_modules -1 \
    --deepspeed scripts/zero3.json \
    --model_id $MODEL_NAME \
    --data_path ./llava_annotations.json \
    --image_folder ./arxiv_paper_images \
    --freeze_vision_tower False \
    --freeze_llm False \
    --tune_projector True \
    --bf16 True \
    --fp16 False \
    --disable_flash_attn2 False \
    --output_dir output/lora_vision_test \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --gradient_checkpointing False \
    --report_to tensorboard \
    --lazy_preprocess True \
    --save_strategy "steps" \
    --save_steps 1400 \
    --save_total_limit 10 \
    --dataloader_num_workers 4

The dataset structure is as follows:

{
  "id": "1",
  "image": "arxiv_2305_02412_0.png",
  "from": "human",
  "value": "<image> What are the main elements visible in this image?",
  "from_gpt": "The image contains the following elements: - 1 ignored region(s) located at: [[857.65, 564.3, 567.8, 413.6]]",
  "from_human": "What is the structure of the table in the image?",
  "from_gpt": "There is no table present in the image.",
  "from_human": "What information can be extracted from the figure in the image?",
  "from_gpt": "There is no figure present in the image."
}

This format enhances readability, especially when posted in a GitHub conversation. It includes code blocks for scripts and JSON data, making the structure clear and easy to follow.

2U1 commented 1 month ago

@Algabeno I think it could be, re-initializing the weight when merging. I'll try to find out for the proper way. Another thing that is, when using fp16 or bf16, the result changes comapring with fp32.

I think the model should be a bit more stable. Also, my code too. I'll try to develop my code for using now.

2U1 commented 1 month ago

@Algabeno Can you load the model when merging without using a config file? model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs) I think it's gonna work.

Algabeno commented 1 month ago

@Algabeno Can you load the model when merging without using a config file? model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs) I think it's gonna work.

@2U1 "Yes, the code runs and makes predictions normally, but it still doesn't achieve the results I am aiming for. Have you ever tried adding some kind of vision dataset for fine-tuning?"

2U1 commented 1 month ago

@Algabeno I tried to but at that time, it only supported fp32. So I couldn't test it with my full dataset. And for now, I have a issue with my resource so need to try a week later.

Is result worse than the original model?

Also when you converted to the llava_type of data, did you erase <image> or added \n after the <image> token?

Algabeno commented 1 month ago

@2U1 "Today I tried different precisions like FP16, FP32, as well as different prompts and epochs to fine-tune the model, but the prediction results were very poor. I kept the in my prompt. Do you think erasing or adding \n after the token would improve the model accuracy? I will try your suggestions later."

@Algabeno I tried to but at that time, it only supported fp32. So I couldn't test it with my full dataset. And for now, I have a issue with my resource so need to try a week later.

Is result worse than the original model?

Also when you converted to the llava_type of data, did you erase <image> or added \n after the <image> token?

2U1 commented 1 month ago

@Algabeno The molmo's input does not need the <image> token. So I made a code to remove it (I made the dataset example for compatible with llava dataset. It's for easy use.). However I only filter the exact pattern <image>\n. So, if you have not add \n or did not erase the <image> token. It will be passed as a input. That could disrupt the model's performance I think. Also, as soon as my issue of resource is solved I'll test the model with my dataset too.

Algabeno commented 1 month ago

@Algabeno The molmo's input does not need the <image> token. So I made a code to remove it (I made the dataset example for compatible with llava dataset. It's for easy use.). However I only filter the exact pattern <image>\n. So, if you have not add \n or did not erase the <image> token. It will be passed as a input. That could disrupt the model's performance I think. Also, as soon as my issue of resource is solved I'll test the model with my dataset too.

@2U1
Thank you for your patient response. After removing \n and retraining, the model's accuracy did not improve. I will try other methods to better fine-tune this model.

2U1 commented 1 month ago

@Algabeno Actually I was saying you should add '\n' to the <image> token when converting to llava style. If it dosen't improve the performance of the model, trying full fine tuning with offload can help maybe.

Algabeno commented 1 month ago

@Algabeno Actually I was saying you should add '\n' to the <image> token when converting to llava style. If it dosen't improve the performance of the model, trying full fine tuning with offload can help maybe.

@2U1 "I misunderstood your meaning, but I have already tried following the LLaVA specification by adding '\n' to the token. The model's performance did not improve. I will follow your advice and try full fine-tuning.

2U1 commented 1 month ago

@Algabeno Thanks. I'll try to figure out about the reason.

2U1 commented 1 month ago

@Algabeno I've removed the code for not saving the wte weight. This might change the result maybe.

Algabeno commented 1 month ago

@Algabeno I've removed the code for not saving the wte weight. This might change the result maybe.

Thank you for your patience. After multiple adjustments to the model yesterday, I found that the results still did not meet my expectations. I achieved ideal results using another model, microsoft/Florence-2-large, and it seems that molmo is not suitable for fine-tuning on specific vision tasks.

2U1 commented 1 month ago

@Algabeno Thanks for letting me know. I'll keep check on if something I missed.

Algabeno commented 1 month ago

@Algabeno Thanks for letting me know. I'll keep check on if something I missed.

@2U1 Thank you for sharing. It seems that the poor performance of the model might be due to the limited size of the dataset I constructed. I will try using larger datasets like ImageNet-22k, Object365, and OpenImages for full fine-tuning.