Yuliang-Liu / Monkey

【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
MIT License
1.82k stars 128 forks source link

textmonkey论文里描述的是"sliding window"是用于切块448大小的时候导致的不连续性 #108

Closed sdjhshbswp closed 3 months ago

sdjhshbswp commented 3 months ago

是在用Shifted Window Attention解决ViT切14*14的patch时导致的不连续性?每个448都是单独的CLIP编码,如何做到各个cross-window有联系的?论文中写的貌似不是很清楚

echo840 commented 3 months ago

您好,Shifted Window Attention是用来解决切成多个448的patch之间的不连续性。通过window attention的方式让每一个448 patch中的token都与其他几个448 patch中的token相关。