Open OPilgrim opened 1 year ago
Hello, thanks for trying our benchmark!
I ran the command mentioned above and got the reasonable result in generation
:
2023-11-01 19:52:21,574 ReForm-Eval Evaluation INFO: the evalueted SingleChoice result: 0.49833333333333335
2023-11-01 19:52:21,575 ReForm-Eval Evaluation INFO: the format hit rate is 0.95
Can you share some samples of the output json files?
When you use likelihood
as the inference mode, you do not need to add --in_context_sample
, and you need to change the number of --dataset_duplication
Sorry for taking so long to reply. Recently, the machine broke down and I have not been able to do the experiment.
I checked the output, but the model output is quite confusing:
The log.txt
:
2023-11-13 16:08:57,188 ReForm-Eval Evaluation INFO: Evaluating with -1 GPUs
2023-11-13 16:08:57,189 ReForm-Eval Evaluation INFO: Loading model: lynx with configure: {"device": "cuda", "half": true, "inference_method": "generation", "model_name": "models/interfaces/lynx/configs/LYNX.yaml"}
2023-11-13 16:10:12,067 ReForm-Eval Evaluation INFO: Each GPU consumes memory of 17025
2023-11-13 16:10:12,067 ReForm-Eval Evaluation INFO: Using upper option mark for the single-choice questions
2023-11-13 16:10:12,081 ReForm-Eval Evaluation INFO: Evaluating model: lynx with configure: {"device": "cuda", "half": true, "inference_method": "generation", "model_name": "models/interfaces/lynx/configs/LYNX.yaml"}
2023-11-13 16:10:13,420 ReForm-Eval Evaluation INFO: ***** Runing Evaluation *****
2023-11-13 16:10:13,421 ReForm-Eval Evaluation INFO: Num examples = 600
2023-11-13 16:10:13,421 ReForm-Eval Evaluation INFO: Batch size = -4
The MSCOCO_SingleChoice_generation_lynx_LYNX_rank-1.json
:
...... {"sample_id": 599, "anno": "Two women standing next to each other with one holding video game controllers.", "answer": "1", "answer_options": ["no", "yes"], "question": "Are the image and caption '{}' representing the same scene? Kindly respond with one following option.", "history": [{"from": "human", "value": "What is the shape of this image? Options: (A) rectangle; (B) circle."}, {"from": "assistant", "value": "The answer is (A) rectangle;"}], "text": "User: What is the shape of this image? Options: (A) rectangle; (B) circle.\nBot: The answer is (A) rectangle;\nUser: Are the image and caption '{}' representing the same scene? Kindly respond with one following option. Options: (A) no; (B) yes.\nBot: The answer is", "question_with_option": "Are the image and caption '{}' representing the same scene? Kindly respond with one following option. Options: (A) no; (B) yes.", "prediction": "js\u0447\u0430\u00e3aftableever\u0446\u0438\u043djs \u0447\u0435 -js\u5206ils Arcjsarcathcienttresjs\u00e3ostildarcangular \u0447\u0435~SelectorFIX compared"}]
Maybe the pre-trained model of lynx is not loaded correctly. Are your LYNX Settings the same as mine?
The LYNX.yaml
## Data
image_rdir: "./images/"
# put your test file in jsonl format
test_files: [ "./data/Open_VQA_images.jsonl" ]
# change this prompt for different task
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"
data: {
num_frames: 5,
}
## Model
vision_encoder: 'eva_vit_1b'
video_encoding: 'concate'
add_frame_pos: True
LLM: 'vicuna-7b'
use_flash_attn: False
use_adapter: True
adapter_freq: 2
bridge: 'resampler'
bridge_depth: 3
num_bridge_tokens: 32
## General
use_left_pad: True
lower_text: True
freeze_vit: True
freeze_llm: True
image_res: 420
image_mean: [ 0.48145466, 0.4578275, 0.40821073 ]
image_std: [ 0.26862954, 0.26130258, 0.27577711 ]
## Testing
checkpoint: "./data/finetune_lynx.pt"
## infer params
max_input_tokens: 40
batch_size_test: 16
max_new_tokens: 64
min_length: 2
num_beams: 5
length_penalty: -2.0
top_p: 0.9
top_k: 3
no_repeat_ngram_size: 2
apply_lemmatizer: False
use_nucleus_sampling: True
Also, when I run it, it shows that the adapter parameters have been reinitialized. Is this normal?
### Building LLM (Freeze: True)
### LLM label_smoothing: 0.0
### Use Flash Attn False
### Add adapters to: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.43s/it]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at models/interfaces/lynx/data/vicuna-7b and are newly initialized: ['model.layers.28.output_adapter.adapter_up.bias', 'model.layers.0.output_adapter.adapter_up.weight', 'model.layers.4.output_adapter.adapter_norm_before.bias', 'model.layers.30.output_adapter.adapter_up.bias', 'model.layers.14.output_adapter.adapter_down.bias', 'model.layers.28.output_adapter.adapter_norm_before.weight', ......
Our LYNX.yaml
is shown below:
## Data
image_rdir: "./images/"
# put your test file in jsonl format
test_files: [ "./data/Open_VQA_images.jsonl" ]
# change this prompt for different task
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"
data: {
num_frames: 5,
}
## Model
vision_encoder: 'eva_vit_1b'
video_encoding: 'concate'
add_frame_pos: True
LLM: 'vicuna-7b'
LLM_base: '/remote-home/share/LLM_CKPT/vicuna-7B-v1.1/'
use_flash_attn: False
use_adapter: True
adapter_freq: 2
bridge: 'resampler'
bridge_depth: 3
num_bridge_tokens: 32
## General
use_left_pad: True
lower_text: True
freeze_vit: True
freeze_llm: True
image_res: 224
image_mean: [ 0.48145466, 0.4578275, 0.40821073 ]
image_std: [ 0.26862954, 0.26130258, 0.27577711 ]
## Testing
checkpoint: "/remote-home/share/multimodal-models/lynx/finetune_lynx.pt"
## infer params
max_input_tokens: 40
batch_size_test: 16
max_new_tokens: 64
min_length: 2
num_beams: 5
length_penalty: -2.0
top_p: 0.9
top_k: 3
no_repeat_ngram_size: 2
apply_lemmatizer: False
use_nucleus_sampling: True
Have you put the lynx repository in /path/to/ReForm-Eval/models/interfaces/lynx
?
Yes, I clone it from https://github.com/bytedance/lynx-llm.git
First of all, thank you for a great job! I ran into a few issues while following the tutorial to reproduce:
I first follow tutorial to emersion lynx ACC on MSCOCO_ITM task, that is, Table18 in the paper. I used the following command:
I used
generation
as the inference method, but the results I get were rather strange:If I use
likelihood
as the inference method, the results are still different from that in the paper:I'm at a loss to understand, and I hope you can help to point out where the problem may be.