FudanDISC / ReForm-Eval

An benchmark for evaluating the capabilities of large vision-language models (LVLMs)
Apache License 2.0
32 stars 4 forks source link

The test results of lynx on the MSCOCO ITM task are questionable #1

Open OPilgrim opened 10 months ago

OPilgrim commented 10 months ago

First of all, thank you for a great job! I ran into a few issues while following the tutorial to reproduce:

I first follow tutorial to emersion lynx ACC on MSCOCO_ITM task, that is, Table18 in the paper. I used the following command:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2  run_eval.py \
    --model lynx  --model_name models/interfaces/lynx/configs/LYNX.yaml \
    --dataset_name MSCOCO --output_dir output/lynx/MSCOCO/test_generation/ \
    --per_gpu_eval_batch_size 4 --formulation SingleChoice \
    --infer_method generation --do_eval --half_evaluation  --dataset_duplication 1 \
    --in_context_sample --option_mark upper \
    --dataset_config build/configs/ImageTextMatching_val.yaml \
    --offline_hf

I used generation as the inference method, but the results I get were rather strange:

2023-11-01 16:00:35,236 ReForm-Eval Evaluation INFO: the evalueted SingleChoice result: 0.0
2023-11-01 16:00:35,236 ReForm-Eval Evaluation INFO: the format hit rate is 0.0

If I use likelihood as the inference method, the results are still different from that in the paper:

2023-11-01 15:39:14,806 ReForm-Eval Evaluation INFO: the evalueted SingleChoice result: 0.5183333333333333
2023-11-01 15:39:14,806 ReForm-Eval Evaluation INFO: the format hit rate is 1.0

I'm at a loss to understand, and I hope you can help to point out where the problem may be.

Aweminus commented 10 months ago

Hello, thanks for trying our benchmark!

I ran the command mentioned above and got the reasonable result in generation:

2023-11-01 19:52:21,574 ReForm-Eval Evaluation INFO: the evalueted SingleChoice result: 0.49833333333333335
2023-11-01 19:52:21,575 ReForm-Eval Evaluation INFO: the format hit rate is 0.95

Can you share some samples of the output json files?

When you use likelihood as the inference mode, you do not need to add --in_context_sample, and you need to change the number of --dataset_duplication

OPilgrim commented 10 months ago

Sorry for taking so long to reply. Recently, the machine broke down and I have not been able to do the experiment. I checked the output, but the model output is quite confusing: The log.txt:

2023-11-13 16:08:57,188 ReForm-Eval Evaluation INFO: Evaluating with -1 GPUs
2023-11-13 16:08:57,189 ReForm-Eval Evaluation INFO: Loading model: lynx with configure: {"device": "cuda", "half": true, "inference_method": "generation", "model_name": "models/interfaces/lynx/configs/LYNX.yaml"}
2023-11-13 16:10:12,067 ReForm-Eval Evaluation INFO: Each GPU consumes memory of 17025
2023-11-13 16:10:12,067 ReForm-Eval Evaluation INFO: Using upper option mark for the single-choice questions
2023-11-13 16:10:12,081 ReForm-Eval Evaluation INFO: Evaluating model: lynx with configure: {"device": "cuda", "half": true, "inference_method": "generation", "model_name": "models/interfaces/lynx/configs/LYNX.yaml"}
2023-11-13 16:10:13,420 ReForm-Eval Evaluation INFO: ***** Runing Evaluation *****
2023-11-13 16:10:13,421 ReForm-Eval Evaluation INFO:   Num examples = 600
2023-11-13 16:10:13,421 ReForm-Eval Evaluation INFO:   Batch size = -4

The MSCOCO_SingleChoice_generation_lynx_LYNX_rank-1.json:

...... {"sample_id": 599, "anno": "Two women standing next to each other with one holding video game controllers.", "answer": "1", "answer_options": ["no", "yes"], "question": "Are the image and caption '{}' representing the same scene? Kindly respond with one following option.", "history": [{"from": "human", "value": "What is the shape of this image? Options: (A) rectangle; (B) circle."}, {"from": "assistant", "value": "The answer is (A) rectangle;"}], "text": "User: What is the shape of this image? Options: (A) rectangle; (B) circle.\nBot: The answer is (A) rectangle;\nUser: Are the image and caption '{}' representing the same scene? Kindly respond with one following option. Options: (A) no; (B) yes.\nBot: The answer is", "question_with_option": "Are the image and caption '{}' representing the same scene? Kindly respond with one following option. Options: (A) no; (B) yes.", "prediction": "js\u0447\u0430\u00e3aftableever\u0446\u0438\u043djs \u0447\u0435 -js\u5206ils Arcjsarcathcienttresjs\u00e3ostildarcangular \u0447\u0435~SelectorFIX compared"}]

Maybe the pre-trained model of lynx is not loaded correctly. Are your LYNX Settings the same as mine? The LYNX.yaml

## Data
image_rdir: "./images/"
# put your test file in jsonl format
test_files: [ "./data/Open_VQA_images.jsonl" ]
# change this prompt for different task
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"
data: {
  num_frames: 5,
}
## Model
vision_encoder: 'eva_vit_1b'
video_encoding: 'concate'
add_frame_pos: True
LLM: 'vicuna-7b'
use_flash_attn: False
use_adapter: True
adapter_freq: 2
bridge: 'resampler'
bridge_depth: 3
num_bridge_tokens: 32
## General
use_left_pad: True
lower_text: True
freeze_vit: True
freeze_llm: True
image_res: 420
image_mean: [ 0.48145466, 0.4578275, 0.40821073 ]
image_std: [ 0.26862954, 0.26130258, 0.27577711 ]
## Testing
checkpoint: "./data/finetune_lynx.pt"
## infer params
max_input_tokens: 40
batch_size_test: 16
max_new_tokens: 64
min_length: 2
num_beams: 5
length_penalty: -2.0
top_p: 0.9
top_k: 3
no_repeat_ngram_size: 2
apply_lemmatizer: False
use_nucleus_sampling: True

Also, when I run it, it shows that the adapter parameters have been reinitialized. Is this normal?

### Building LLM (Freeze: True)
### LLM label_smoothing:  0.0
### Use Flash Attn False
### Add adapters to:  [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.43s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at models/interfaces/lynx/data/vicuna-7b and are newly initialized: ['model.layers.28.output_adapter.adapter_up.bias', 'model.layers.0.output_adapter.adapter_up.weight', 'model.layers.4.output_adapter.adapter_norm_before.bias', 'model.layers.30.output_adapter.adapter_up.bias', 'model.layers.14.output_adapter.adapter_down.bias', 'model.layers.28.output_adapter.adapter_norm_before.weight', ......
Aweminus commented 10 months ago

Our LYNX.yaml is shown below:

## Data
image_rdir: "./images/"

# put your test file in jsonl format
test_files: [ "./data/Open_VQA_images.jsonl" ]

# change this prompt for different task
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"

data: {
  num_frames: 5,
}

## Model
vision_encoder: 'eva_vit_1b'
video_encoding: 'concate'
add_frame_pos: True

LLM: 'vicuna-7b'
LLM_base: '/remote-home/share/LLM_CKPT/vicuna-7B-v1.1/'
use_flash_attn: False
use_adapter: True
adapter_freq: 2

bridge: 'resampler'
bridge_depth: 3
num_bridge_tokens: 32

## General
use_left_pad: True
lower_text: True
freeze_vit: True
freeze_llm: True
image_res: 224
image_mean: [ 0.48145466, 0.4578275, 0.40821073 ]
image_std: [ 0.26862954, 0.26130258, 0.27577711 ]

## Testing
checkpoint: "/remote-home/share/multimodal-models/lynx/finetune_lynx.pt"

## infer params
max_input_tokens: 40
batch_size_test: 16
max_new_tokens: 64
min_length: 2
num_beams: 5
length_penalty: -2.0
top_p: 0.9
top_k: 3
no_repeat_ngram_size: 2
apply_lemmatizer: False
use_nucleus_sampling: True

Have you put the lynx repository in /path/to/ReForm-Eval/models/interfaces/lynx ?

OPilgrim commented 10 months ago

Yes, I clone it from https://github.com/bytedance/lynx-llm.git