ancwrd1 / qpdf-rs

Rust bindings for QPDF C++ library
17 stars 7 forks source link

[Question] How to get bounding box information for text on a page? #1

Closed arifd closed 1 year ago

arifd commented 2 years ago

Hello! I am interested in using qpdf-rs to extract the text and its bounding box information from a pdf.

I'm not very familiar with the PDF spec, so perhaps this already has me at a disadvantage.

Would you mind showing some example code, how I can achieve this? Thank you very much!

ancwrd1 commented 2 years ago

Hi, QPDF (and this project) allows you to manipulate and read PDF objects as a whole, I don't think you can easily extract a text from it, because it must be rendered first (the concept of 'text' is vague anyway, it can be a sequence of vector drawing operators or a Tj output operator). The best you can probably do is to get page contents from each page and try to parse it.

arifd commented 2 years ago

Getting the page contents and parsing them is like finding some obj that explicitly expresses it represents Unicode, and then reading that to a String, while also grabbing the bounding box (is it called MediaBox) at the same time?

I'm quite happy to fallback to OCR when i know the extraction of text and their bounding boxes can not be done confidently

ancwrd1 commented 2 years ago

Pages are just dictionaries in QPdf, you can check the test_qpdf.rs/test_pdf_ops how to get page contents. What you get then is the raw contents which most likely will be some PDF operators. The bounding box is /MediaBox, yes.

ancwrd1 commented 2 years ago

So the code might look like this for example:

let qpdf = QPdf::read("/path/to/pdf")?;
let pages = qpdf.get_pages()?;
for page in pages {
    match page.get("/MediaBox") {
        Some(obj) => {
            let array: QPdfArray = obj.into();
            println!("mediabox: {:?}", array);
        }
        None => {}
    }
    let content = page.get_page_content_data()?;
    println!("content: {}", String::from_utf8_lossy(data.as_ref()));
}