Closed yuyq96 closed 1 month ago
@yuyq96 Well, I hadn't noticed TextHawk before. I just read your paper and there are definitely some similar aspects with our work. We will discuss your work in our paper later.
Great! Similar structures are expected to be fully explored in future works. By the way, the next generation of TextHawk we plan to release will demonstrate its effectiveness.
@yuyq96 Hello, could you please share the amount of instruction data that TextHawk used in the Mixed Resolution Supervised Fine-Tuning stage? I found the data numbers for the first two stages in your paper.
@LiWentomng Sure, I can provide it tomorrow.
@LiWentomng TextHawk used ~2200k raw instruction data, including 665k from LLaVA-1.5, 365K from Shikra, 102K from ShareGPT4v, 136K from SVIT, 848K from ALLaVA-Instruct&Text, and 64K (+30K) from UReader (+DocGemini). The number of data after concatenation is 796K.
@yuyq96 OK, thanks.
Hello, I’m the author of TextHawk. I have noticed that TokenPacker shares the same starting points as TextHawk, including token compression and multi-level features. It's great to see more people getting involved in this field. TextHawk was released in April this year, and I hope you can compare it in your paper. TextHawk paper: https://arxiv.org/abs/2404.09204