This repository contains the implementation of the paper:
Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases
Kai Chen*, [Yanze Li]()*, [Wenhua Zhang]()*, [Yanxin Liu](), Pengxiang Li, Ruiyuan Gao, Lanqing Hong†, [Meng Tian](), [Xinhai Zhao](), Zhenguo Li, Dit-Yan Yeung, Huchuan Lu, Xu Jia†
Equal Contribution †Corresponding Authors
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025*
The instructions for downloading CODA-LM are listed as follows:
Split | Size | Image Source | Original Format | LLaVA Format |
---|---|---|---|---|
Train | 4884 | CODA2022 val | HF Hub | HF Hub |
Val | 4384 | CODA2022 test | HF Hub | HF Hub |
Test | 500 | CODA2022 test | HF Hub | HF Hub |
Mini | 50 | CODA2022 test | HF Hub | HF Hub |
Note that:
After decompression, the data organization is listed as follows:
├── val -- CODA2022 val (we only use images)
│ │── images
│ │ │── *.jpg
├── test -- CODA2022 test (we only use images)
│ │── images
│ │ │── *.jpg
├── CODA-LM
│ │── Train -- CODA-LM train (we use 4884 images from CODA2022 val)
│ │ │── val_*.json
│ │── Val -- CODA-LM val (we use 4384 images from CODA2022 test)
│ │ │── test_*.json
│ │── Test -- CODA-LM test (we use 500 images from CODA2022 test)
│ │ │── test_*.json
│ │── Mini -- CODA-LM mini (a 50-image subset of CODA-LM val)
│ │ │── test_*.json
The annotation files contains question-answering pairs for all three tasks as following,
{
"general_perception":{
"vehicles": [ -- list containing information on all vehicles
{
"description": <str>, -- description about a single vehicle
"explanation": <str>" -- explanation why it affects the ego car
},
"vulnerable_road_users": [...], -- list containing information on all VRUs
"traffic signs": [...], -- list containing information on all traffic signs
"traffic lights": [...], -- list containing information on all traffic lights
"traffic cones": [...], -- list containing information on all traffic cones
"barriers": [...], -- list containing information on all barriers
"other objects": [...], -- list containing information on all other objects
"description and explanation": <str> -- summarization of information on all categories
},
"region_perception":{
"1": { -- region index
"description and explanation": <str>, -- description of road users in the specific region with explanation on why it affects the ego car
"box": <list of float>, -- xywh coordinates
"category_name": <str> -- object category
},
"2": {...},
"3": {...}
},
"driving_suggestion": <str>,
}
To better facilitate training LVLMs with CODA-LM, we further organize the CODA-LM data in the LLaVA data format, based on which, we can use CODA-LM simply by utilizing the HuggingFace APIs.
Install the HuggingFace datasets dependency via pip.
pip install datasets
Download and load the specified subsets and splits of CODA-LM. Note that by default, we adopt the red rectangle prompt for the regional perception task.
from datasets import load_dataset
# name can be selected from ['English', 'Chinese']
# split can be selected from ['Mini', 'Train', 'Val', 'Test']
dataset = load_dataset("KaiChen1998/coda-lm-llava-format", name="English", split='Train')
# should be a dictionary containing
# {"id": sample identification, 'image': PIL Image, 'conversations': with <image> token}
for data in dataset:
print(data)
To help users better understand the structure of CODA-LM, we provide a python script to convert our annotations to basic VQA formats, as follows:
Download the data and make sure the directory organization follows Data Prepration.
Run convert2vqa.py
as follows:
# English
python convert2vqa.py --coda_root $CODA_ROOT --codalm_ann_name CODA-LM
# Chinese
python convert2vqa.py --coda_root $CODA_ROOT --codalm_ann_name CODA-LM-chinese
After that, the resulting data organization will be like this:
├── val
│ │── images
│ │── images_w_bboxes -- Images with bboxes drawn for region perception
│ │ │── *.jpg
├── test
│ │── images
│ │── images_w_bboxes -- Images with bboxes drawn for region perception
│ │ │── *.jpg
├── CODA-LM
│ │── Train
│ │ │── vqa_anno
│ │ │ │── general_perception.jsonl -- VQA annotations for general perception
│ │ │ │── region_perception.jsonl -- VQA annotations for region perception
│ │ │ │── driving_suggestion.jsonl -- VQA annotations for driving suggestion
│ │── Val
│ │ │── vqa_anno
│ │── Test
│ │ │── vqa_anno
│ │── Mini
│ │ │── vqa_anno
The basic VQA format saves data sample simply with a dictionary containing question_id
, image
, question
, and answer
, as follows:
{"question_id": 0, "image": val/images/0001.jpg, "question": <str>, "answer": <str>}
{"question_id": 1, "image": val/images/0002.jpg, "question": <str>, "answer": <str>}
{"question_id": 2, "image": val/images/0003.jpg, "question": <str>, "answer": <str>}
...
Note that for regional perception, there are various possible manners to utilize the bbox annotations. Here we provide a simple implementation by drawing the bboxes with red rectangles on images, which are saved in the images_w_bboxes
directory.
Check CODA-LM Annotation Tool for more details.
Check CODA-LM Evaluation for more details.
@article{li2024automated,
title={Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases},
author={Li, Yanze and Zhang, Wenhua and Chen, Kai and Liu, Yanxin and Li, Pengxiang and Gao, Ruiyuan and Hong, Lanqing and Tian, Meng and Zhao, Xinhai and Li, Zhenguo and others},
journal={arXiv preprint arXiv:2404.10595},
year={2024}
}