Closed zezeze97 closed 4 months ago
With S^2-Wrapper and 3x scale up, each image would be encoded (1+4+9=)13 times. However, it only causes a 23% training time increase on my device (8*A800).
@Isaachhh 请问你提到的4代表什么,我能理解1是全局信息,9是3*3个局部信息
Please refer to here.
img_sizes=[384,768,1152]
.
Besides the global encoding, the image would be resized to 768x768 and splited into 4 sub-images and resized to 1152x1152 and splited into 9 sub-images.
You can find more information from the original paper.
image token数是729吗?
image token数是729吗?
For SigLIP, yes.
你好,我在使用你们代码进行训练时发现,使用S2增加图像的输入分辨率的方式的训练速度比不使用S2慢4倍,这种现象正常吗?