Adaptive Select is not clear

Yuliang-Liu / Monkey

【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models

MIT License

1.77k stars 122 forks source link

Adaptive Select is not clear #130

Closed TayeeChang closed 4 days ago

TayeeChang commented 2 weeks ago

Hi authors,

The adaptive select in the MSAC is not clearly descripted in your paper. You only said in your paper: "The adaptive layer will adaptively generate an aspect ratio based on the detailed layer" So, how to adaptively generate an aspect ratio?

Thanks.

tigerzjh commented 2 weeks ago

源码实现就是取了次优比例

TayeeChang commented 2 weeks ago

这个应该在论文里说明白，因为MSAC是核心创新点。可以说明一下“adaptively generate an aspect ratio?” 在源码具体位置吗？

tigerzjh commented 2 weeks ago

这个应该在论文里说明白，因为MSAC是核心创新点。可以说明一下“adaptively generate an aspect ratio?” 在源码具体位置吗？

如果论文里说明白，你可能会失望，哈哈。我写了解读：https://blog.csdn.net/u012863603/article/details/141670951 源码就在那个demo.py里

TayeeChang commented 2 weeks ago

这个url需要订阅？这种介绍自己论文工作的还要求订阅不太合理吧？

pixel_values, target_aspect_ratio = load_image('xxx.jpg', min_num=4, max_num=12) pixel_values = pixel_values.to(torch.bfloat16).cuda() pixel_values2 = load_image2('xxx.jpg', min_num=3, max_num=7, target_aspect_ratio=target_aspect_ratio) pixel_values2 = pixel_values2.to(torch.bfloat16).cuda() pixel_values = torch.cat([pixel_values2[:-1], pixel_values[:-1], pixel_values2[-1:]], 0)

问一下这个min_num和max_num你们是怎么选的？

tigerzjh commented 2 weeks ago

这个url需要订阅？这种介绍自己论文工作的还要求订阅不太合理吧？

pixel_values, target_aspect_ratio = load_image('xxx.jpg', min_num=4, max_num=12) pixel_values = pixel_values.to(torch.bfloat16).cuda() pixel_values2 = load_image2('xxx.jpg', min_num=3, max_num=7, target_aspect_ratio=target_aspect_ratio) pixel_values2 = pixel_values2.to(torch.bfloat16).cuda() pixel_values = torch.cat([pixel_values2[:-1], pixel_values[:-1], pixel_values2[-1:]], 0)

问一下这个min_num和max_num你们是怎么选的？

第一，我不是官方人。第二，我也是疑惑，自己花时间看了源码，也已经回答了你的问题。第三，我没让你定，而且基于2已免费解决，知识订阅仁者见仁。

mxin262 commented 2 weeks ago

感谢您对Mini-Monkey的关注。 adaptive实现上并不是选择次优比例。 adaptive主要体现在，在上一层选择出了一个最佳的宽高比的时候，这里会根据上一层的这个宽高比，自适应的避开和这个宽高比成倍数关系的比例，并选取剩下的比例里面最优的比例，避免一个物体或者文本同时在不同层被切分到。代码主要在这里和这里

mxin262 commented 1 week ago

请问问题解决了吗，还有什么需要帮助的

mxin262 commented 4 days ago

Since there was no response for a long time, I closed it. Please feel free to reopen it if you have any further questions.