Closed failable closed 5 months ago
Hi @failable , thank you for the query.
The PDF file format does not include any native support for tables. (That is why that PyMuPDF find_tables()
function refers repeatedly to "finding" or "detecting" the table boundaries and layout.) Consequently there is no functionality in Pdfium to handle tables in the manner you are asking for, and therefore no intent to add functionality to pdfium-render
to support that use case - at least prior to the release of crate version 1.0.
Post version 1.0, I am open to adding enhancements like this. I have made a note in #29 to consider this feature again at that point. But prior to 1.0 there is no intention for pdfium-render
to support any functionality that Pdfium itself does not already provide.
If you have a working solution using PyMuPDF, why not simply stick with it?
Thanks for the information. I'm porting my application for a faster solution in Rust. I'm going to spend some time to dig into it.
Whatever approach PyMuPDF is taking, it's likely it can be replicated with pdfium-render
. The most basic approach would be to build a data structure containing all the bounding boxes of all the text objects on a page, and then compare those bounding boxes to see which have x or y coordinates in common. That would imply a grid arrangement, and from there you should be able to extract the text out of each text object in the grid and put it into some other table format.
Hi, can you please add an example of handling tables? I'm using
pymupdf
before for this in Python.Thanks.