Closed arifd closed 1 year ago
Hi, QPDF (and this project) allows you to manipulate and read PDF objects as a whole, I don't think you can easily extract a text from it, because it must be rendered first (the concept of 'text' is vague anyway, it can be a sequence of vector drawing operators or a Tj output operator). The best you can probably do is to get page contents from each page and try to parse it.
Getting the page contents and parsing them is like finding some obj that explicitly expresses it represents Unicode, and then reading that to a String, while also grabbing the bounding box (is it called MediaBox) at the same time?
I'm quite happy to fallback to OCR when i know the extraction of text and their bounding boxes can not be done confidently
Pages are just dictionaries in QPdf, you can check the test_qpdf.rs/test_pdf_ops
how to get page contents.
What you get then is the raw contents which most likely will be some PDF operators.
The bounding box is /MediaBox
, yes.
So the code might look like this for example:
let qpdf = QPdf::read("/path/to/pdf")?;
let pages = qpdf.get_pages()?;
for page in pages {
match page.get("/MediaBox") {
Some(obj) => {
let array: QPdfArray = obj.into();
println!("mediabox: {:?}", array);
}
None => {}
}
let content = page.get_page_content_data()?;
println!("content: {}", String::from_utf8_lossy(data.as_ref()));
}
Hello! I am interested in using qpdf-rs to extract the text and its bounding box information from a pdf.
I'm not very familiar with the PDF spec, so perhaps this already has me at a disadvantage.
Would you mind showing some example code, how I can achieve this? Thank you very much!