Open mllife opened 2 months ago
Hey @mllife! Thanks for the note, as you can see here: matching_post_processor.py We already have quite involved post processing with cell matching / massaging / orphan-picking etc.
As an output cell bounding boxes encompass content that is located in the cell, so coordinate-wise cells in the same line might not always match pixel-wise.
If you could provide some open examples to understand the problem, that would help, alternatively feel free to modify the code and make a PR, so we can also run it on wast collection of tables that we have.
Thanks again!
I will try to create some similar artificial examples and share with you. Currently, I also need to improve my side of preprocessing steps I guess. As I wrote my own backbone parser with pymupdf to integrate with your code and using the low resolution images as input. Can you share if increasing input page image resolution can help?
Hey @mllife, increasing resolution certainly can help, we noticed a bump in accuracy if we increase resolution from 72dpi to 150dpi, but anything above doesn't help.
Thanks, for help. I will try it out and update here.
I updated my code to receive high dpi input and mapped the tokens accordingly, I see some improvement. I still see the model is randomly struggling if the tables have big cells with lot of content in a single cell. Hopefully, you will add some checks that all the tokens inside a table have be assigned to some cell.
This is issue, I am facing https://github.com/DS4SD/docling/issues/278 missing text assignment for long cells
Thanks, for all your work. I see in some of the tables, not all the tokens are assigned to cell text. I think this can be handled in post processing to make sure that all tokens that are within the table bounding box are assigned to some cell (row/coloumn). Also, sometime rows cells are not aligned, I think this can be fixed by checking the Xmin of each cell within the row, basically to keep everything parallel. Can you please look into these cases. I think the first one should be a obvious one.
Sorry, I would have shared examples to check but the documents I have are sensitive. I will try to find any similar examples and share if possible.