CircleRadon / TokenPacker

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
148 stars 6 forks source link

Comparison with TextHawk #8

Closed yuyq96 closed 1 month ago

yuyq96 commented 1 month ago

Hello, I’m the author of TextHawk. I have noticed that TokenPacker shares the same starting points as TextHawk, including token compression and multi-level features. It's great to see more people getting involved in this field. TextHawk was released in April this year, and I hope you can compare it in your paper. TextHawk paper: https://arxiv.org/abs/2404.09204

LiWentomng commented 1 month ago

@yuyq96 Well, I hadn't noticed TextHawk before. I just read your paper and there are definitely some similar aspects with our work. We will discuss your work in our paper later.

yuyq96 commented 1 month ago

Great! Similar structures are expected to be fully explored in future works. By the way, the next generation of TextHawk we plan to release will demonstrate its effectiveness.

LiWentomng commented 1 month ago

@yuyq96 Hello, could you please share the amount of instruction data that TextHawk used in the Mixed Resolution Supervised Fine-Tuning stage? I found the data numbers for the first two stages in your paper.

yuyq96 commented 1 month ago

@LiWentomng Sure, I can provide it tomorrow.

yuyq96 commented 1 month ago

@LiWentomng TextHawk used ~2200k raw instruction data, including 665k from LLaVA-1.5, 365K from Shikra, 102K from ShareGPT4v, 136K from SVIT, 848K from ALLaVA-Instruct&Text, and 64K (+30K) from UReader (+DocGemini). The number of data after concatenation is 796K.

LiWentomng commented 1 month ago

@yuyq96 OK, thanks.