This repository contains the data loader and evaluation code for our Ref-L4, a new REC benchmark in the LMM era. Check out our paper Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models for more details.
Referring expression comprehension (REC) involves localizing a target based on a textual description. Recent advancements with large multimodal models (LMMs) like CogVLM have achieved high accuracy (92.44% on RefCOCO). However, existing benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have high labeling error rates (14%, 24%, and 5% respectively), undermining evaluations. We address this by excluding problematic instances and reevaluating LMMs, showing significant accuracy improvements. We also introduce Ref-L4, a new REC benchmark with:
In the REC task, a referring expression should uniquely describe an instance, which is represented by an accurate bounding box. We have identified and visualized three common types of labeling errors in the RefCOCO, RefCOCO+, and RefCOCOg benchmarks: (a) non-unique referring expressions, which refer to multiple instances within the same image; (b) inaccurate bounding boxes; and (c) misalignment between target instances and their referring expressions, where the referring expressions are either ambiguous or do not refer to any instance in the image.
git clone https://github.com/JierunChen/Ref-L4.git
cd Ref-L4
pip install ./
from ref_l4 import RefL4Dataset
ref_l4_dataset = RefL4Dataset("JierunChen/Ref-L4", split='all')
print(len(ref_l4_dataset))
# iterate over the dataset
for img_pil, data in ref_l4_dataset:
print(data)
break
# swith to 'val' or 'test' split
ref_l4_dataset.change_split('val')
print(len(ref_l4_dataset))
The Ref-L4 dataset can also be downloaded from Hugging Face.
After getting model predictions, we can save them as a json file, which is a list of dictionaries, each containing three keys: [id, pred_bbox, format]
. The id
is the annotation ID, and the format
specifies the format of pred_bbox
, which should be either xyxy
or xywh
.
A sample prediction file the can be found here. It is obtained by running the CogVLM-Grounding model.
Now we can run the following command to evaluate the predictions with ground truth.
python evaluate_pred.py --dataset_path JierunChen/Ref-L4 --split all --pred_json_path ./demo_models/cogvlm_grounding_pred.json
Here is an example output for the predictions from CogVLM-Grounding on the "all" splits:
Item for split all | Value
--------------------------------------------------------------------------------
Ann-level acc iou 0.5 | 81.69868331091065
Ann-level acc iou 0.75 | 70.76597340155709
Ann-level acc iou 0.9 | 48.351381751615534
Ann-level macc iou 0.5:0.95 | 66.08808804393375
Ann-level accs for copy | 81.7, 70.77, 48.35, 66.09
--------------------------------------------------------------------------------
Small acc iou 0.5 | 75.0561797752809
Small macc iou 0.5:0.95 | 52.853932584269664
Medium acc iou 0.5 | 86.43470790378007
Medium macc iou 0.5:0.95 | 71.31099656357388
Large acc iou 0.5 | 77.90972003774772
Large macc iou 0.5:0.95 | 66.25196602705253
Size level accs for copy | 75.06, 52.85, 86.43, 71.31, 77.91, 66.25
--------------------------------------------------------------------------------
Average class-level acc iou 0.5 | 72.42029130234654
Average class-level macc iou 0.5:0.95 | 52.56151912966669
Avg class-level accs for copy | 72.42, 52.56
The output includes three parts:
The Ref-L4 dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Please note that the images in the Ref-L4 dataset are derived from the following datasets, each with their respective licenses:
By using the Ref-L4 dataset, you agree to comply with the licensing terms of these source datasets.