Encountering problems when using the mmbech data set to reproduce the evaluation task

HuBocheng commented 2 months ago

I am using the following bash script and command with the mmbench dataset for replication purposes. The model file ./checkpoints-lora-total/merged_Bunny-phi-siglip was downloaded from this link: BAAI/Bunny-v1_0-3B · Hugging Face.

#!/bin/bash

SPLIT="MMBench_DEV_EN_legacy"
LANG=en
MODEL_TYPE=phi-2
MODEL_BASE=/root/weight/phi-2
TARGET_DIR=bunny-lora-phi-2

python -m bunny.eval.model_vqa_mmbench \
    --model-path ./checkpoints-lora-total/merged_Bunny-phi-siglip \
    --model-base $MODEL_BASE \
    --model-type $MODEL_TYPE \
    --question-file ./datasets/mmbench/$SPLIT.tsv \
    --answers-file ./eval/mmbench/answers/$SPLIT/$TARGET_DIR.jsonl \
    --lang $LANG \
    --single-pred-prompt \
    --temperature 0 \
    --conv-mode bunny

mkdir -p eval/mmbench/answers_upload/$SPLIT.tsv

python eval/mmbench/convert_mmbench_for_submission.py \
    --annotation-file ./datasets/mmbench/$SPLIT.tsv \
    --result-dir ./eval/mmbench/answers/$SPLIT \
    --upload-dir ./eval/mmbench/answers_upload/$SPLIT \
    --experiment $TARGET_DIR

The command used was: CUDA_VISIBLE_DEVICES=0 sh script/eval/lora/mmbench.sh

The following error occurred:

Loading Bunny from base model...
Loading checkpoint shards:   0%|                                                   | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/bunny/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/bunny/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/Bunny/bunny/eval/model_vqa_mmbench.py", line 167, in <module>
    eval_model(args)
  File "/root/Bunny/bunny/eval/model_vqa_mmbench.py", line 59, in eval_model
    tokenizer, model, image processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "/root/Bunny/bunny/model/builder.py", line 109, in load_pretrained_model
    model = BunnyPhiForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True,
  File "/opt/conda/envs/bunny/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3677, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/envs/bunny/lib.python3.10/site-packages/transformers/modeling_utils.py", line 4104, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/envs/bunny/lib.python3.10/site-packages/transformers/modeling_utils.py", line 886, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/envs/bunny/lib.python3.10/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([51200, 2560]) in "weight" (which has shape torch.Size([50295, 2560])), this look incorrect.

Subsequently, I attempted to manually merge the lora weights. I downloaded the weights from BAAI/bunny-phi-2-siglip-lora · Hugging Face and saved them as ./checkpoints-lora-total/unmerged_Bunny-phi-siglip, and ran the following command:

python script/merge_lora_weights.py \
        --model-path ./checkpoints-lora-total/unmerged_Bunny-phi-siglip \
        --model-base /root/weight/phi-2 \
        --model-type phi-2 \
        --save-model-path ./checkpoints-lora-total/hand_merged_Bunny-phi-siglip

However,

I still encountered an error:

Traceback (most recent call last):
  File "/root/Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "/root/Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "/root/Bunny/bunny/model/builder.py", line 128, in load_pretrained_model
    mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
  File "/opt/conda/envs/bunny/lib.python3.10/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/bunny/lib.python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/bunny/lib.python3.10/site-packages/torch/serialization.py, line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints-lora-total/unmerged_Bunny-phi-siglip/mm_projector.bin'

I would greatly appreciate your assistance in understanding why these errors are occurring.

Isaachhh commented 2 months ago

For evaluating our merged weghts:

Evaluation for LoRA tuning models

You can use script/merge_lora_weights.py to merge the LoRA weights and base LLM, and then evaluate it as [evaluation_full.md](https://github.com/BAAI-DCAI/Bunny/blob/main/script/eval/full/evaluation_full.md).

then evaluate it as evaluation_full.md CUDA_VISIBLE_DEVICES=0 sh script/eval/full/mmbench.sh

For merging weights: You may mkdir first because both levels of folder don't exist.

HuBocheng commented 2 months ago

I have identified the source of the issue. It occurred because after downloading the unmerged LoRA weights, I renamed the folder to unmerged_Bunny-phi-siglip. However, the merge_lora_weights.py script determines whether to load the LoRA weights based on whether the model name contains the word "lora". Since my folder name did not include "lora", the script mistakenly attempted to load the pretrained weights and searched for the mm_projector.bin file.

 if 'lora' in model_name.lower() and model_base is None:
        warnings.warn(
            'There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument.')
    if 'lora' in model_name.lower() and model_base is not None:
        lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)

By including "lora" in the folder name, the script ran successfully as below. I appreciate your understanding and thank you for your patience as I worked through this issue. :thumbsup::smile:

Isaachhh commented 2 months ago

Great!

Isaachhh commented 1 month ago

Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.

BAAI-DCAI / Bunny

Encountering problems when using the mmbech data set to reproduce the evaluation task #58