关于Visual Grounding任务精度评估的问题

YizhuoQ commented 4 months ago

请问文章中表4在测试Qwen-VL-Chat和MiniGPTv2的定位精度时，是使用的官方发布的预训练模型还是把模型在两个数据集上又分别进行了微调？另外，表4中Visual Grounding精度计算的脚本有开源吗？

pUmpKin-Co commented 4 months ago

您好.

直接通过Prompting进行测试. 脚本已开源，请参考main_vg.py.

YizhuoQ commented 3 months ago

您好，非常抱歉这么晚才回复。关于上面VG任务的精度评价问题，我还是没有彻底搞明白。我参考main_vg.py和#issue18中您给出的raw prediction result dior_rsvg_eval_save_file.json写了一段验证脚本，但是输出结果仅为85.36，似乎和Table4中的结果有些出入。我不太明白这是怎么回事，是由于测试时模型输出的随机性造成的吗？还想请大佬答疑解惑。下面是我的验证脚本和输出：

import re
import json
import logging

def calculate_iou(box1, box2):
    """
    Calculate IoU between two horizontal bounding boxes (HBB).
    """
    x1, y1, x2, y2 = box1
    x3, y3, x4, y4 = box2

    intersection_x1 = max(x1, x3)
    intersection_y1 = max(y1, y3)
    intersection_x2 = min(x2, x4)
    intersection_y2 = min(y2, y4)

    intersection_area = max(0, intersection_x2 - intersection_x1 + 1) * max(
        0, intersection_y2 - intersection_y1 + 1
    )

    box1_area = (x2 - x1 + 1) * (y2 - y1 + 1)
    box2_area = (x4 - x3 + 1) * (y4 - y3 + 1)

    union_area = box1_area + box2_area - intersection_area

    iou = intersection_area / union_area

    return iou

if __name__ == "__main__":
    answers_file = 'E:/Code/dior_rsvg_eval_save_file.json'
    pattern = r"\[([0-9., ]+)\]"
    with open(answers_file) as f:
        predictions = json.load(f)

    parse_result = []
    fail_instance = 0

    for item in predictions:
        pred_match = re.findall(pattern, item["pred"])
        if len(pred_match) == 0:
            fail_instance += 1

        try:
            pred_result = [list(map(float, match.split(","))) for match in pred_match]
        except:
            fail_instance += 1
            continue

        target_match = re.findall(pattern, item["target"])
        target_result = [list(map(float, match.split(","))) for match in target_match]

        new_pred_result = []
        new_target_result = []

        for pred, target in zip(pred_result, target_result):
            if len(pred) == 4:
                new_pred_result.append(pred)
                new_target_result.append(target)
            elif len(pred) > 4:
                while len(pred) != 4:
                    pred.pop()
                new_pred_result.append(pred)
                new_target_result.append(target)
            else:
                fail_instance += 1

        if len(new_pred_result) > 0:
            parse_result.append(
                dict(
                    filename=item["filename"],
                    pred=new_pred_result,
                    target=new_target_result,
                )
            )

    count = 0
    total = 0

    for item in parse_result:
        preds = item["pred"]
        targets = item["target"]

        for pred, target in zip(preds, targets):
            iou_score = calculate_iou(pred, target)
            if iou_score > 0.5:
                count += 1
            total += 1

    print(f"Accuracy: {count / total * 100:.2f}%")
    print(f"Fail Sample: {fail_instance}")
    print(f"Accuracy With Fail Sample: {count / (total + fail_instance) * 100:.2f}%")

输出结果：

Accuracy: 85.38579458354624
Fail Sample: 0
Accuracy With Fail Sample: 85.38579458354624

YizhuoQ commented 3 months ago

已成功使用Stage3 checkpoint和LHRS RSVG Test Data复现了论文中Table 4给出的结果，感谢开源分享～完整输出如下：

[2024-08-26 15:00:10,020] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
Not using distributed mode.
accelerator: gpu
adjust_norm: false
alignment_dim: 768
batch_size: 1
bf16: true
bits: 16
config: null
data_path: /datasets/DIOR-RSVG/Images
data_target: /workspace/mllm-code/eval-vg/data/LHRS_Data/RSVG/Test/RSVG_DIOR_test.json
double_quant: true
dtype: float16
enable_amp: true
entity: pumpkinn
epochs: 2
eval:
  dataset: AID
fp16: false
generate: false
gpus: 0
inf_sampler: false
is_distribute: false
local_rank: 0
lora:
  enable: false
  lora_alpha: 256
  lora_bias: none
  lora_dropout: 0.05
  lora_r: 128
lr: 0.0002
max_grad_norm: 0.3
model_path: checkpoint/stage3/FINAL.pt
optimizer: adanp
opts: null
output: output
project: MaskIndexNet
prompt_template: llava_llama_2
quant_type: nf4
rank: 0
rgb_vision:
  arch: vit_large
  attn_pooler:
    num_attn_heads: 16
    num_layers: 6
    num_query: 144
  input_patchnorm: false
  input_size:
  - 224
  - 224
  patch_dropout: 0.0
  tune_pooler: true
  vit_name: openai/clip-vit-large-patch14
sar_vision:
  activate: sigmoid
  alpha: 0.2
  arch: base
  branch_temp: 0.07
  decoder:
    heads: 12
    hidden_size: 768
    layers: 12
    mask_color: mean
    mask_ratio: 0.6
  focal_gamma: 1.0
  in_chans: 2
  input_size:
  - 192
  - 192
  loss_weight: 1.0
  n_queries: 256
  online_temp: 0.1
  reduction: none
  residual: false
  unmask_weight: 0.0
  warmup_branch_temp: 0.04
  warmup_branch_temp_epochs: 2
schedule:
  decay_epochs: 30
  decay_rate: 0.1
  gamma: 0.1
  min_lr: 2.0e-05
  multisteps: []
  name: cosine
  warmup_epochs: 100
  warmup_factor: 0.01
  warmup_method: linear
seed: 322
stage: 0
text:
  bos_token_id: 1
  eos_token_id: 2
  hidden_act: silu
  hidden_size: 4096
  initializer_range: 0.02
  intermediate_size: 11008
  max_position_embeddings: 2048
  num_attention_heads: 32
  num_hidden_layers: 32
  pad_token_id: 0
  path: /huggingface/models/Llama-2-7b-chat-hf
  rms_norm_eps: 1e-5
  tie_word_embeddings: false
  use_cache: true
  vocab_size: 32000
transform:
  input_size:
  - 224
  - 224
  rand_aug: rand-m5-n2-mstd0.5-inc1
tune_im_patch: false
tune_im_start: false
tune_rgb_bk: false
tune_rgb_pooler: false
use_checkpoint: false
wandb: false
wd: 0.0
workers: 2
world_size: 1

[08/26 15:00:13 train]: Full config saved to output/config.json
[08/26 15:00:13 train]: accelerator: gpu
adjust_norm: false
alignment_dim: 768
batch_size: 1
bf16: true
bits: 16
config: null
data_path: /datasets/DIOR-RSVG/Images
data_target: /workspace/mllm-code/eval-vg/data/LHRS_Data/RSVG/Test/RSVG_DIOR_test.json
double_quant: true
dtype: float16
enable_amp: true
entity: pumpkinn
epochs: 2
eval:
  dataset: AID
fp16: false
generate: false
gpus: 0
inf_sampler: false
is_distribute: false
local_rank: 0
lora:
  enable: false
  lora_alpha: 256
  lora_bias: none
  lora_dropout: 0.05
  lora_r: 128
lr: 0.0002
max_grad_norm: 0.3
model_path: checkpoint/stage3/FINAL.pt
optimizer: adanp
opts: null
output: output
project: MaskIndexNet
prompt_template: llava_llama_2
quant_type: nf4
rank: 0
rgb_vision:
  arch: vit_large
  attn_pooler:
    num_attn_heads: 16
    num_layers: 6
    num_query: 144
  input_patchnorm: false
  input_size:
  - 224
  - 224
  patch_dropout: 0.0
  tune_pooler: true
  vit_name: openai/clip-vit-large-patch14
sar_vision:
  activate: sigmoid
  alpha: 0.2
  arch: base
  branch_temp: 0.07
  decoder:
    heads: 12
    hidden_size: 768
    layers: 12
    mask_color: mean
    mask_ratio: 0.6
  focal_gamma: 1.0
  in_chans: 2
  input_size:
  - 192
  - 192
  loss_weight: 1.0
  n_queries: 256
  online_temp: 0.1
  reduction: none
  residual: false
  unmask_weight: 0.0
  warmup_branch_temp: 0.04
  warmup_branch_temp_epochs: 2
schedule:
  decay_epochs: 30
  decay_rate: 0.1
  gamma: 0.1
  min_lr: 2.0e-05
  multisteps: []
  name: cosine
  warmup_epochs: 100
  warmup_factor: 0.01
  warmup_method: linear
seed: 322
stage: 0
text:
  bos_token_id: 1
  eos_token_id: 2
  hidden_act: silu
  hidden_size: 4096
  initializer_range: 0.02
  intermediate_size: 11008
  max_position_embeddings: 2048
  num_attention_heads: 32
  num_hidden_layers: 32
  pad_token_id: 0
  path: /huggingface/models/Llama-2-7b-chat-hf
  rms_norm_eps: 1e-5
  tie_word_embeddings: false
  use_cache: true
  vocab_size: 32000
transform:
  input_size:
  - 224
  - 224
  rand_aug: rand-m5-n2-mstd0.5-inc1
tune_im_patch: false
tune_im_start: false
tune_rgb_bk: false
tune_rgb_pooler: false
use_checkpoint: false
wandb: false
wd: 0.0
workers: 2
world_size: 1

[08/26 15:00:13 train]: Creating model
/opt/conda/envs/lhrs/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.03s/it]
3372
[08/26 15:00:20 train]: Data Length: 3372
[08/26 15:00:20 train]: Loading pretrained checkpoint from checkpoint/stage3/FINAL.pt
[08/26 15:00:20 train]: Loading RGB encoder.
[08/26 15:00:20 train]: After loading RGB encoder: Missing: []. Unexpected: []
[08/26 15:00:20 train]: Loadding LoRA parameters.
Evaluating: 100%|████████████████████████████████████████████████████████████████| 3.37k/3.37k [1:59:53<00:00, 2.13s/it]
[08/26 17:00:19 train]: result file saved to output/eval_save_file.json
[08/26 17:00:19 train]: Count: 6211
[08/26 17:00:19 train]: Total: 6973
[08/26 17:00:19 train]: Accuracy: 89.07213537932024
[08/26 17:00:19 train]: Fail Sample: 8
[08/26 17:00:19 train]: Accuracy With Fail Sample: 88.97006159575992

zackhxn commented 2 months ago

请问在计算iou的时候，函数calculate_iou中为什么要在：

intersection_area = max(0, intersection_x2 - intersection_x1 + 1) * max(
    0, intersection_y2 - intersection_y1 + 1
)
box1_area = (x2 - x1 + 1) * (y2 - y1 + 1)
box2_area = (x4 - x3 + 1) * (y4 - y3 + 1)

进行+1呢？传入的box不是已经归一化了吗？+1之后不就导致计算结果错误了吗

YizhuoQ commented 1 month ago

请问在计算iou的时候，函数calculate_iou中为什么要在：
intersection_area = max(0, intersection_x2 - intersection_x1 + 1) * max(
    0, intersection_y2 - intersection_y1 + 1
)
box1_area = (x2 - x1 + 1) * (y2 - y1 + 1)
box2_area = (x4 - x3 + 1) * (y4 - y3 + 1)
进行+1呢？传入的box不是已经归一化了吗？+1之后不就导致计算结果错误了吗

你好，代码中的calculate_iou函数参考自main_vg.py

pUmpKin-Co commented 1 month ago

您好，感谢指出，请参考#27.

YizhuoQ commented 1 month ago

请问在计算iou的时候，函数calculate_iou中为什么要在：
intersection_area = max(0, intersection_x2 - intersection_x1 + 1) * max(
    0, intersection_y2 - intersection_y1 + 1
)
box1_area = (x2 - x1 + 1) * (y2 - y1 + 1)
box2_area = (x4 - x3 + 1) * (y4 - y3 + 1)
进行+1呢？传入的box不是已经归一化了吗？+1之后不就导致计算结果错误了吗

感谢您指出这个错误，感谢作者的回复。

NJU-LHRS / LHRS-Bot

关于Visual Grounding任务精度评估的问题 #20