ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.
https://crates.io/crates/pdfium-render
Other
364 stars 59 forks source link

Example of extracting tables? #149

Closed failable closed 5 months ago

failable commented 5 months ago

Hi, can you please add an example of handling tables? I'm using pymupdf before for this in Python.

Thanks.

ajrcarey commented 5 months ago

Hi @failable , thank you for the query.

The PDF file format does not include any native support for tables. (That is why that PyMuPDF find_tables() function refers repeatedly to "finding" or "detecting" the table boundaries and layout.) Consequently there is no functionality in Pdfium to handle tables in the manner you are asking for, and therefore no intent to add functionality to pdfium-render to support that use case - at least prior to the release of crate version 1.0.

Post version 1.0, I am open to adding enhancements like this. I have made a note in #29 to consider this feature again at that point. But prior to 1.0 there is no intention for pdfium-render to support any functionality that Pdfium itself does not already provide.

If you have a working solution using PyMuPDF, why not simply stick with it?

failable commented 5 months ago

Thanks for the information. I'm porting my application for a faster solution in Rust. I'm going to spend some time to dig into it.

ajrcarey commented 5 months ago

Whatever approach PyMuPDF is taking, it's likely it can be replicated with pdfium-render. The most basic approach would be to build a data structure containing all the bounding boxes of all the text objects on a page, and then compare those bounding boxes to see which have x or y coordinates in common. That would imply a grid arrangement, and from there you should be able to extract the text out of each text object in the grid and put it into some other table format.