Model Evaluation - Githubissues

BeachWang commented 9 months ago

Hi,

I have trained imp with lora. However, it does not process the reference when I run the evaluation scripts. Following is the output when I eval pope.

bruceisme commented 9 months ago

Hi, thanks for your interest. You can add --model-base microsoft/phi-2 \ to the following code in pope.sh. Then it might work and start to run the scripts. https://github.com/MILVLG/imp/blob/48bdc607089385e017f7b5c1f16762625a39eaf1/scripts/eval/pope.sh#L22-L29 And we will fix this problem as soon as possible. Let me know if there's any other questions.

BeachWang commented 9 months ago

Hi, I have added the --model-base. But there is another problem.

ParadoxZW commented 9 months ago

Hi @BeachWang ! This bug is also caused by the update of phi-2 repo itself. More specifically, the phi-2 team has changed the parameter names in model weights, i.e., the keys in .safetensors files. Thus there is a severe mismatch between current model weights of phi-2 and our codebase, which leading this loading error.

Our code cannot load "new version" of phi-2 unless we change our code to the new model definition of phi-2. But in that case, old checkpoints will also be unavailable. So we decide to maintain our code and recommend users to download old version of phi-2 repo.

Specifically, you can run following python script to download phi-2 to a local folder:

import os
# os.environ["https_proxy"] = "http://xxx.xxx.xxx.xxx:xx"  # in case you need proxy to access Huggingface Hub
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="microsoft/phi-2", 
    revision="d3186761bf5c4409f7679359284066c25ab668ee",
    local_dir='checkpoints/base/phi-2',
    local_dir_use_symlinks=False
)

And use checkpoints/base/phi-2 as base model argument for future experiments.

BeachWang commented 9 months ago

Hi @ParadoxZW . Do we need to roll back the code for #7 if we use the old version of phi-2, since the codes for training have been made compatible to the new version of phi-2.

ParadoxZW commented 9 months ago

No need to roll back. The code should always work for old version of phi-2 for both training and evaluation.

MIL-VLG commented 9 months ago

See our latest update :) @BeachWang

BeachWang commented 9 months ago

Hi, I have succeed in reproducing your work. But I find there are some problems in the eval of pope and sqa. Specifically, it should be $EVAL_CKPT but not $CKPT in line 27 and 47 in pope.sh. As for sqa, I got the following error when running sqa.sh.

The reason seems be that imp use model_vqa_loader to eval sqa, but LLaVA use model_vqa_science actually.

bruceisme commented 9 months ago

Hi, thanks for your reminder. We will fix these bugs in next update.

In scienceQA, you should use model_vqa_science for llava_test_CQM-A.json, and we rewriting a question file scienceqa_multi.jsonl which follows multiple-choice's prompt in LLaVA's Evaluation.md and fits to model_vqa_loader. You can see the detail in next update too.

MILVLG / imp

Model Evaluation #8