Potential data contamination in SPHINX

iojasdsda commented 10 months ago

The SPHINX model leverages COCO object detection, pose estimation, and LVIS object detection annotations during fine-tuning. Notably, test datasets like RefCOCO and POPE, which are part of the COCO dataset, raise concerns about potential data contamination. If a model is trained on the entire COCO train2017 and LVIS detection datasets, which include images, object labels, and object bounding boxes data from some test sets (e.g., RefCOCO and POPE), there's a possibility that the model's performance on these test sets is improved due to this overlap. For instance, in our observations, using the COCO train2017 detection data for training noticeably increased the accuracy on the RefCOCO and F1 score on the POPE dataset. Could you clarify if the overlap data is identified and removed during training?

gaopengpjlab commented 10 months ago

Thanks for your kind reminding.

SPHINX-V1 do not incorporate COCO/LVIS dataset during SFT. Please try the following checkpoint to replicate number reported in our paper and check COCO/LVIS data is not incuded. https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/SPHINX/SPHINX-1k

SPHINX-V2 use COCO/LVIS dataset. https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/SPHINX/SPHINX-v2-1k

iojasdsda commented 10 months ago

Thanks for your reply. In the SPHINX paper (https://arxiv.org/pdf/2311.07575.pdf), you mentioned: "We introduce abundant general object detection and pose estimation datasets, such as COCO (Lin et al., 2014) and LVIS (Gupta et al., 2019) to inspire the model’s capabilities of localization, classification, and human pose estimation.". Does this paper present SPHINX-V2?

Alpha-VLLM / LLaMA2-Accessory

Potential data contamination in SPHINX #111