Yuliang-Liu / Monkey

【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
MIT License
1.77k stars 122 forks source link

Adaptive Select is not clear #130

Closed TayeeChang closed 4 days ago

TayeeChang commented 2 weeks ago

Hi authors,

The adaptive select in the MSAC is not clearly descripted in your paper. You only said in your paper: "The adaptive layer will adaptively generate an aspect ratio based on the detailed layer" So, how to adaptively generate an aspect ratio?

Thanks.

tigerzjh commented 2 weeks ago

源码实现就是取了次优比例

TayeeChang commented 2 weeks ago

这个应该在论文里说明白,因为MSAC是核心创新点。可以说明一下“adaptively generate an aspect ratio?” 在源码具体位置吗?

tigerzjh commented 2 weeks ago

这个应该在论文里说明白,因为MSAC是核心创新点。可以说明一下“adaptively generate an aspect ratio?” 在源码具体位置吗?

如果论文里说明白,你可能会失望,哈哈。 我写了解读:https://blog.csdn.net/u012863603/article/details/141670951 源码就在那个demo.py里

TayeeChang commented 2 weeks ago

这个url需要订阅?这种介绍自己论文工作的还要求订阅不太合理吧?

pixel_values, target_aspect_ratio = load_image('xxx.jpg', min_num=4, max_num=12) pixel_values = pixel_values.to(torch.bfloat16).cuda() pixel_values2 = load_image2('xxx.jpg', min_num=3, max_num=7, target_aspect_ratio=target_aspect_ratio) pixel_values2 = pixel_values2.to(torch.bfloat16).cuda() pixel_values = torch.cat([pixel_values2[:-1], pixel_values[:-1], pixel_values2[-1:]], 0)

问一下这个min_num和max_num你们是怎么选的?

tigerzjh commented 2 weeks ago

这个url需要订阅?这种介绍自己论文工作的还要求订阅不太合理吧?

pixel_values, target_aspect_ratio = load_image('xxx.jpg', min_num=4, max_num=12) pixel_values = pixel_values.to(torch.bfloat16).cuda() pixel_values2 = load_image2('xxx.jpg', min_num=3, max_num=7, target_aspect_ratio=target_aspect_ratio) pixel_values2 = pixel_values2.to(torch.bfloat16).cuda() pixel_values = torch.cat([pixel_values2[:-1], pixel_values[:-1], pixel_values2[-1:]], 0)

问一下这个min_num和max_num你们是怎么选的?

第一,我不是官方人。 第二,我也是疑惑,自己花时间看了源码,也已经回答了你的问题。 第三,我没让你定,而且基于2已免费解决,知识订阅仁者见仁。

mxin262 commented 2 weeks ago

感谢您对Mini-Monkey的关注。 adaptive实现上并不是选择次优比例。 adaptive主要体现在,在上一层选择出了一个最佳的宽高比的时候,这里会根据上一层的这个宽高比,自适应的避开和这个宽高比成倍数关系的比例,并选取剩下的比例里面最优的比例,避免一个物体或者文本同时在不同层被切分到。代码主要在这里这里

mxin262 commented 1 week ago

请问问题解决了吗,还有什么需要帮助的

mxin262 commented 4 days ago

Since there was no response for a long time, I closed it. Please feel free to reopen it if you have any further questions.