Add PdfParagraph to allow for more natural processing of multi-line text.

ajrcarey / pdfium-render

A high-level idiomatic Rust wrapper around Pdfium, the C++ PDF library used by the Google Chromium project.

https://crates.io/crates/pdfium-render

Other

359 stars 59 forks source link

Add PdfParagraph to allow for more natural processing of multi-line text. #29

Open ajrcarey opened 2 years ago

ajrcarey commented 2 years ago

Follow-on from #17, #22, #25. Add a PdfParagraph object that allows for easier handling of multi-line text with embedded character formatting changes.

Ideally, it would be possible to generate a PdfParagraph from an existing set of PdfPageTextObject objects, each one containing a formatted fragment of a paragraph.

ajrcarey commented 2 years ago

PdfParagraph object construction under way. Hidden from crate prelude for now.

russellwmy commented 1 year ago

I have a pdf file which render each char with Td and Tj operation. Some page takes more than 10 seconds to extract text. I tested with page level extraction. it is under a second. Do you think PdfParagraph would solve this problem?

ajrcarey commented 1 year ago

Hi @russellwmy , if you are just looking to extract text, then no, PdfParagraph will not be useful to you. The goal of PdfParagraph is to make it easier to work with the formatting and justification of multiple text objects. If you just want to extract the raw text, then page level extraction via PdfPage::text()?.all() is the fastest way. See https://github.com/ajrcarey/pdfium-render/blob/master/examples/text_extract.rs for an example.

russellwmy commented 1 year ago

@ajrcarey Good to know. I find a way to speed it up now.

first call PdfPage::text()?.all()
then iterate the TextObject, map the font, location, etc. and make use of PdfPageText.for_object(text_object) In this way, we don't need to load the page again and again for each object. it is 100x faster. :)

Just an idea, do you think this can implement internally?

ajrcarey commented 1 year ago

Every time you create PdfPageText, Pdfium analyses all the text on the page. So you're right, the most efficient way is to create PdfPageText once, then reuse it:

let page_text = page.text()?; // this creates PdfPageText once

// now can use page_text.for_object(...) in an iterator, etc.

However, I would expect this to still be slower than page_text.all(). Calling all() avoids the need to iterate.

ajrcarey commented 1 year ago

Made improvements to segment detection. Implemented prototype lines to paragraphs accumulator. Made PdfParagraph public in response to #121, although it isn't part of the crate prelude.

ajrcarey commented 4 months ago

Consider also adding handling of tables as suggested in #149.

ajrcarey commented 3 months ago

Moved PdfParagraph behind new feature flag paragraph. The change will take effect in release 0.8.23.

ziimakc commented 3 hours ago

@ajrcarey Thanks you for this library and this super useful feature. Currently it seems like something is broken in imports:

use crate::page::PdfPage;
   |            ^^^^
   |            |
   |            unresolved import
   |            help: a similar path exists: `pdf::document::page`

.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdfium-render-0.8.26/src/pdf/document/page/paragraph.rs:635:20
    |
242 | pub struct PdfParagraph<'a> {
    | --------------------------- doesn't satisfy `PdfParagraph<'_>: Sized`
...
635 |         paragraphs.push(Self::paragraph_from_lines(
    |         -----------^^^^ method cannot be called on `Vec<PdfParagraph<'_>>` due to unsatisfied trait bounds