A question about VG evaluation - Githubissues

NJU-LHRS / LHRS-Bot

VGI-Enhanced multimodal large language model for remote sensing images.

Apache License 2.0

107 stars 9 forks source link

A question about VG evaluation #18

Closed xuliu-cyber closed 4 months ago

xuliu-cyber commented 4 months ago

I Use the stage3 checkpoint for evaluation. But I got the visual grounding results different with the paper mentioned. Here are the log outputs I got with the accuracy of 28.75:

[32m[06/27 10:52:52 train]: [0mFull config saved to eval/vg/DIOR-RSVG/config.json [32m[06/27 10:52:52 train]: [0maccelerator: gpu adjust_norm: false alignment_dim: 768 batch_size: 1 bf16: true bits: 16 config: null data_path: /data/liux/DIOR-RSVG/JPEGImages data_target: /data/liux/DIOR-RSVG/test.json double_quant: true dtype: float16 enable_amp: true entity: pumpkinn epochs: 2 eval: dataset: AID fp16: false generate: false gpus: 0 inf_sampler: false is_distribute: false local_rank: 0 lora: enable: false lora_alpha: 256 lora_bias: none lora_dropout: 0.05 lora_r: 128 lr: 0.0002 max_grad_norm: 0.3 model_path: /data/liux/LHRS/Stage3/FINAL.pt optimizer: adanp opts: null output: eval/vg/DIOR-RSVG project: MaskIndexNet prompt_template: llava_llama_2 quant_type: nf4 rank: 0 rgb_vision: arch: vit_large attn_pooler: num_attn_heads: 16 num_layers: 6 num_query: 144 input_patchnorm: false input_size:

224
224 patch_dropout: 0.0 tune_pooler: true vit_name: openai/clip-vit-large-patch14 sar_vision: activate: sigmoid alpha: 0.2 arch: base branch_temp: 0.07 decoder: heads: 12 hidden_size: 768 layers: 12 mask_color: mean mask_ratio: 0.6 focal_gamma: 1.0 in_chans: 2 input_size:
192
192 loss_weight: 1.0 n_queries: 256 online_temp: 0.1 reduction: none residual: false unmask_weight: 0.0 warmup_branch_temp: 0.04 warmup_branch_temp_epochs: 2 schedule: decay_epochs: 30 decay_rate: 0.1 gamma: 0.1 min_lr: 2.0e-05 multisteps: [] name: cosine warmup_epochs: 100 warmup_factor: 0.01 warmup_method: linear seed: 322 stage: 0 text: bos_token_id: 1 eos_token_id: 2 hidden_act: silu hidden_size: 4096 initializer_range: 0.02 intermediate_size: 11008 max_position_embeddings: 2048 num_attention_heads: 32 num_hidden_layers: 32 pad_token_id: 0 path: /data/liux/LHRS/Llama-2-7b-chat-hf rms_norm_eps: 1e-5 tie_word_embeddings: false use_cache: true vocab_size: 32000 transform: input_size:
224
224 rand_aug: rand-m5-n2-mstd0.5-inc1 tune_im_patch: false tune_im_start: false tune_rgb_bk: false tune_rgb_pooler: false use_checkpoint: false wandb: false wd: 0.0 workers: 2 world_size: 1

[32m[06/27 10:52:52 train]: [0mCreating model [32m[06/27 10:53:56 train]: [0mData Length: 7500 [32m[06/27 10:53:56 train]: [0mLoading pretrained checkpoint from /data/liux/LHRS/Stage3/FINAL.pt [32m[06/27 10:53:57 train]: [0mLoading RGB encoder. [32m[06/27 10:53:57 train]: [0mAfter loading RGB encoder: Missing: []. Unexpected: [] [32m[06/27 10:53:57 train]: [0mLoadding LoRA parameters. [32m[06/27 12:06:29 train]: [0mresult file saved to eval/vg/DIOR-RSVG/eval_save_file.json [32m[06/27 12:06:29 train]: [0mAccuracy: 28.75025651549354 [32m[06/27 12:06:29 train]: [0mFail Sample: 1229 [32m[06/27 12:06:29 train]: [0mAccuracy With Fail Sample: 22.959685349065882

pUmpKin-Co commented 4 months ago

Hi~ Thanks for you interest.

We just test over on the test set by converting the test set into instruction format. The test set can be found at here.

Below the result obtained from one evaluation run.

RSVG

[32m[01/18 16:38:23 train]: [0mFull config saved to ../../Output/LHRS/stage3/zero_shot_vg/rsvg/config.json [32m[01/18 16:38:23 train]: [0maccelerator: gpu adjust_norm: false alignment_dim: 768 batch_size: 1 bf16: true bits: 16 config: null data_path: /home/aiscuser/pumpkin_dataset/InstructDataset/RSVG_Image data_target: /home/aiscuser/pumpkin_dataset/Eval/VGEvalDataset/RSVG_test.json double_quant: true dtype: float16 enable_amp: true entity: pumpkinn epochs: 2 eval: dataset: AID fp16: false generate: false gpus: 0 inf_sampler: false is_distribute: false local_rank: 0 lora: enable: false lora_alpha: 16 lora_bias: none lora_dropout: 0.05 lora_r: 8 lr: 0.0003 max_grad_norm: 1.0 model_path: ../../Output/LHRS/stage3/checkpoints/FINAL.pt optimizer: adanp opts: null output: ../../Output/LHRS/stage3/zero_shot_vg/rsvg project: MaskIndexNet prompt_template: llava_llama_2 quant_type: nf4 rank: 0 rgb_vision: arch: vit_large attn_pooler: num_attn_heads: 16 num_layers: 6 num_query: 144 input_patchnorm: false input_size: - 224 - 224 patch_dropout: 0.0 tune_pooler: true vit_name: openai/clip-vit-large-patch14 sar_vision: activate: sigmoid alpha: 0.2 arch: base branch_temp: 0.07 decoder: heads: 12 hidden_size: 768 layers: 12 mask_color: mean mask_ratio: 0.6 focal_gamma: 1.0 in_chans: 2 input_size: - 192 - 192 loss_weight: 1.0 n_queries: 256 online_temp: 0.1 reduction: none residual: false unmask_weight: 0.0 warmup_branch_temp: 0.04 warmup_branch_temp_epochs: 2 schedule: decay_epochs: 30 decay_rate: 0.1 gamma: 0.1 min_lr: 2.0e-05 multisteps: [] name: cosine warmup_epochs: 100 warmup_factor: 0.01 warmup_method: linear seed: 322 stage: 2 text: bos_token_id: 1 eos_token_id: 2 hidden_act: silu hidden_size: 4096 initializer_range: 0.02 intermediate_size: 11008 max_position_embeddings: 2048 num_attention_heads: 32 num_hidden_layers: 32 pad_token_id: 0 path: /home/aiscuser/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1b0db933684edbfe29a06fa47eb19cc48025e93 rms_norm_eps: 1e-5 tie_word_embeddings: false use_cache: true vocab_size: 32000 transform: input_size: - 224 - 224 rand_aug: rand-m5-n2-mstd0.5-inc1 tune_im_patch: false tune_im_start: false tune_rgb_bk: false tune_rgb_pooler: false use_checkpoint: false wandb: false wd: 0.02 workers: 4 world_size: 1 [32m[01/18 16:38:23 train]: [0mCreating model [32m[01/18 16:38:33 train]: [0mData Length: 1227 [32m[01/18 16:38:33 train]: [0mLoading pretrained checkpoint from ../../Output/LHRS/stage3/checkpoints/FINAL.pt [32m[01/18 16:38:34 train]: [0mLoading RGB encoder. [32m[01/18 16:38:35 train]: [0mAfter loading RGB encoder: Missing: []. Unexpected: [] [32m[01/18 16:38:35 train]: [0mLoadding LoRA parameters. [32m[01/18 17:00:58 train]: [0mresult file saved to ../../Output/LHRS/stage3/zero_shot_vg/rsvg/eval_save_file.json [32m[01/18 17:00:58 train]: [0mAccuracy: 71.94851330203444 [32m[01/18 17:00:58 train]: [0mFail Sample: 0 [32m[01/18 17:00:58 train]: [0mAccuracy With Fail Sample: 71.94851330203444

DIOR-RSVG

[32m[01/18 17:01:19 train]: [0mFull config saved to ../../Output/LHRS/stage3/zero_shot_vg/rsvg_dior/config.json [32m[01/18 17:01:19 train]: [0maccelerator: gpu adjust_norm: false alignment_dim: 768 batch_size: 1 bf16: true bits: 16 config: null data_path: /home/aiscuser/pumpkin_dataset/InstructDataset/RSVG_DIOR_Image data_target: /home/aiscuser/pumpkin_dataset/Eval/VGEvalDataset/RSVG_DIOR_test.json double_quant: true dtype: float16 enable_amp: true entity: pumpkinn epochs: 2 eval: dataset: AID fp16: false generate: false gpus: 0 inf_sampler: false is_distribute: false local_rank: 0 lora: enable: false lora_alpha: 16 lora_bias: none lora_dropout: 0.05 lora_r: 8 lr: 0.0003 max_grad_norm: 1.0 model_path: ../../Output/LHRS/stage3/checkpoints/FINAL.pt optimizer: adanp opts: null output: ../../Output/LHRS/stage3/zero_shot_vg/rsvg_dior project: MaskIndexNet prompt_template: llava_llama_2 quant_type: nf4 rank: 0 rgb_vision: arch: vit_large attn_pooler: num_attn_heads: 16 num_layers: 6 num_query: 144 input_patchnorm: false input_size: - 224 - 224 patch_dropout: 0.0 tune_pooler: true vit_name: openai/clip-vit-large-patch14 sar_vision: activate: sigmoid alpha: 0.2 arch: base branch_temp: 0.07 decoder: heads: 12 hidden_size: 768 layers: 12 mask_color: mean mask_ratio: 0.6 focal_gamma: 1.0 in_chans: 2 input_size: - 192 - 192 loss_weight: 1.0 n_queries: 256 online_temp: 0.1 reduction: none residual: false unmask_weight: 0.0 warmup_branch_temp: 0.04 warmup_branch_temp_epochs: 2 schedule: decay_epochs: 30 decay_rate: 0.1 gamma: 0.1 min_lr: 2.0e-05 multisteps: [] name: cosine warmup_epochs: 100 warmup_factor: 0.01 warmup_method: linear seed: 322 stage: 2 text: bos_token_id: 1 eos_token_id: 2 hidden_act: silu hidden_size: 4096 initializer_range: 0.02 intermediate_size: 11008 max_position_embeddings: 2048 num_attention_heads: 32 num_hidden_layers: 32 pad_token_id: 0 path: /home/aiscuser/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1b0db933684edbfe29a06fa47eb19cc48025e93 rms_norm_eps: 1e-5 tie_word_embeddings: false use_cache: true vocab_size: 32000 transform: input_size: - 224 - 224 rand_aug: rand-m5-n2-mstd0.5-inc1 tune_im_patch: false tune_im_start: false tune_rgb_bk: false tune_rgb_pooler: false use_checkpoint: false wandb: false wd: 0.02 workers: 4 world_size: 1 [32m[01/18 17:01:19 train]: [0mCreating model [32m[01/18 17:01:29 train]: [0mData Length: 1813 [32m[01/18 17:01:29 train]: [0mLoading pretrained checkpoint from ../../Output/LHRS/stage3/checkpoints/FINAL.pt [32m[01/18 17:01:30 train]: [0mLoading RGB encoder. [32m[01/18 17:01:30 train]: [0mAfter loading RGB encoder: Missing: []. Unexpected: [] [32m[01/18 17:01:30 train]: [0mLoadding LoRA parameters. [32m[01/18 18:00:30 train]: [0mresult file saved to ../../Output/LHRS/stage3/zero_shot_vg/rsvg_dior/eval_save_file.json [32m[01/18 18:00:30 train]: [0mAccuracy: 87.09759836484416 [32m[01/18 18:00:30 train]: [0mFail Sample: 0 [32m[01/18 18:00:30 train]: [0mAccuracy With Fail Sample: 87.09759836484416

Moreover, we also glad to provide the raw prediction result for your reference. rsvg_eval_save_file.json dior_rsvg_eval_save_file.json

Finally, all of our data and training script will be release soon~

xuliu-cyber commented 4 months ago

Thank you! I will check it.

xuliu-cyber commented 4 months ago

Hi, I find the DIOR-RSVG test json file in the https://huggingface.co/datasets/PumpkinCat/LHRS_Data/tree/main doesn't contain the whole items as the original DIOR-RSVG test datasethttps://drive.google.com/drive/folders/1hTqtYsC6B-m4ED2ewx5oKuYZV13EoJp_?usp=sharing.

pUmpKin-Co commented 4 months ago

Really thanks for reaching out! I will double check when I have any bandwidth.

xuliu-cyber commented 4 months ago

OK. I found the reproductive classification and VQA eval results are as same as the paper, except for the VG

pUmpKin-Co commented 4 months ago

Hi,

Sorry for the late reply.

I have checked the data and found that the issue is related to a mismatch between test.txt and the corresponding .xml annotations.

You will notice that many entries in test.txt do not have corresponding .xml annotation files. For example, the entry 0 in test.txt does not have an associated 0.xml (or any similarly named) file.

We reformatted the annotations into text format based on the correspondence between test.txt and the .xml files. As a result, we only have 3,372 test samples.

I hope this clarifies your concern.

xuliu-cyber commented 4 months ago

Thanks, I know where I go wrong!

YizhuoQ commented 3 months ago

Hi xuliu, did you successfully reproduce the accuracy of the visual grounding task from the paper?

YizhuoQ commented 3 months ago

Hi xuliu, did you successfully reproduce the accuracy of the visual grounding task from the paper?

I have known where i did wrong. I didn't unzip the TextLoRA.zip at the first time.