The knowledge of COCO and PASCAL VOC is transferred from VG. And your concern can be explained:
We use frequency distribution as the regular term, so learned region-to-region graph doesn't deviate human commonsense knowledge but it will be individualized according to image context.
Annotations of VG includes the various scene, and results show that knowledge form VG is the benefit for COCO and VOC. We use the same knowledge but different method to process the detection task in CVPR2019. Paper and code are coming soon.
请问: 1)在COCO和PASCAL VOC中如何获取显式的语言知识? 2)如果是从Visual Genome数据集迁移过来的,那又如何处理例外情况呢?就拿paper中的水果图为例,橘子不可能总是橙色的吧?香蕉也不会总在碗里吧?