BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
924 stars 69 forks source link

question about s2 #89

Closed zezeze97 closed 4 months ago

zezeze97 commented 5 months ago

你好,我在使用你们代码进行训练时发现,使用S2增加图像的输入分辨率的方式的训练速度比不使用S2慢4倍,这种现象正常吗?

Isaachhh commented 5 months ago

With S^2-Wrapper and 3x scale up, each image would be encoded (1+4+9=)13 times. However, it only causes a 23% training time increase on my device (8*A800).

berry-ding commented 5 months ago

@Isaachhh 请问你提到的4代表什么,我能理解1是全局信息,9是3*3个局部信息

Isaachhh commented 5 months ago

Please refer to here.

img_sizes=[384,768,1152].

Besides the global encoding, the image would be resized to 768x768 and splited into 4 sub-images and resized to 1152x1152 and splited into 9 sub-images.

You can find more information from the original paper.

zezeze97 commented 4 months ago

image token数是729吗?

Isaachhh commented 4 months ago

image token数是729吗?

For SigLIP, yes.