在COCO和PASCAL VOC中如何获取显式的语言知识？

The knowledge of COCO and PASCAL VOC is transferred from VG. And your concern can be explained:

We use frequency distribution as the regular term, so learned region-to-region graph doesn't deviate human commonsense knowledge but it will be individualized according to image context.
Annotations of VG includes the various scene, and results show that knowledge form VG is the benefit for COCO and VOC. We use the same knowledge but different method to process the detection task in CVPR2019. Paper and code are coming soon.

chanyn / HKRM