OpenGVLab / all-seeing

[ICLR 2024] This is the official implementation of the paper "The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World"
https://huggingface.co/spaces/OpenGVLab/all-seeing
425 stars 13 forks source link

Special tokens #9

Closed KooSung closed 4 months ago

KooSung commented 4 months ago

Nice work! Why didn't all-see-v2 add \ etc. to the special tokens?

Weiyun1025 commented 4 months ago

Thank you for your interest in our project.

Early experimental results indicate that adding special tokens, such as <ref>, <box>, and <rel> has only a minor impact on performance. Therefore, to maintain simplicity, we have decided not to add any special tokens.

KooSung commented 4 months ago

@Weiyun1025 Thanks. Another question, during the training of the regular detection model, it is necessary to adjust the bbox based on the image preprocessing, but why is it only necessary to normalize the bbox to 0-1000 (or with square_pad) during LLM training? Qwen-VL also does this, but the reason is not explained.

Weiyun1025 commented 4 months ago

Adjusting the bboxes is necessary when data augmentation is utilized. However, we do not use any data augmentation except for image flipping, for which we preprocess the bboxes offline.

For the second question, since the input size of ASMv2 is only 336x336, a scale of 1000 is large enough. If the input size were scaled up to 2000x2000, it might be necessary to enlarge the scale.