Closed Sere1nz closed 1 month ago
Thank you very much for your attention to our paper on text chunking. The issue you raised regarding the segmentation of texts containing tables and images in real-world scenarios is indeed a practical problem worthy of in-depth exploration. In our research, we have primarily focused on segmentation algorithms for long natural language texts, specifically on how to learn efficient text segmentation methods through logical perception.
However, your insights remind us that in practical applications, multimodal information is important for understanding and analyzing text content. Currently, the refined processing of multimodal information falls outside the scope of our research, so we may only be able to provide you with some potential ideas. For instance, you could first convert PDF documents into the more easily processable Markdown format [1,2] (which we found quite useful in our previous experience), then segment the text, and finally utilize a multimodal retrieval model to match the text with images. We hope this information is of assistance to you!
[1] https://mp.weixin.qq.com/s/P7-VhEpoNDkTJbhN7dGExA [2] https://mp.weixin.qq.com/s/Ntqu8RrsJd07fRcJi8JShw
Thank you.
It seems we only talking about text, but what if there are relevant tables or images that include stats or knowlegde after the text, which means the tables and images should be part of previous texts(should be in the same chunk). What should we do in this case?