Open ajrcarey opened 2 years ago
PdfParagraph object construction under way. Hidden from crate prelude for now.
I have a pdf file which render each char with Td and Tj operation. Some page takes more than 10 seconds to extract text. I tested with page level extraction. it is under a second. Do you think PdfParagraph would solve this problem?
Hi @russellwmy , if you are just looking to extract text, then no, PdfParagraph
will not be useful to you. The goal of PdfParagraph
is to make it easier to work with the formatting and justification of multiple text objects. If you just want to extract the raw text, then page level extraction via PdfPage::text()?.all()
is the fastest way. See https://github.com/ajrcarey/pdfium-render/blob/master/examples/text_extract.rs for an example.
@ajrcarey Good to know. I find a way to speed it up now.
PdfPage::text()?.all()
TextObject
, map the font, location, etc. and make use of PdfPageText.for_object(text_object)
In this way, we don't need to load the page again and again for each object.
it is 100x faster. :)Just an idea, do you think this can implement internally?
Every time you create PdfPageText
, Pdfium analyses all the text on the page. So you're right, the most efficient way is to create PdfPageText
once, then reuse it:
let page_text = page.text()?; // this creates PdfPageText once
// now can use page_text.for_object(...) in an iterator, etc.
However, I would expect this to still be slower than page_text.all()
. Calling all()
avoids the need to iterate.
Made improvements to segment detection. Implemented prototype lines to paragraphs accumulator. Made PdfParagraph
public in response to #121, although it isn't part of the crate prelude.
Consider also adding handling of tables as suggested in #149.
Moved PdfParagraph behind new feature flag paragraph. The change will take effect in release 0.8.23.
@ajrcarey Thanks you for this library and this super useful feature. Currently it seems like something is broken in imports:
use crate::page::PdfPage;
| ^^^^
| |
| unresolved import
| help: a similar path exists: `pdf::document::page`
.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdfium-render-0.8.26/src/pdf/document/page/paragraph.rs:635:20
|
242 | pub struct PdfParagraph<'a> {
| --------------------------- doesn't satisfy `PdfParagraph<'_>: Sized`
...
635 | paragraphs.push(Self::paragraph_from_lines(
| -----------^^^^ method cannot be called on `Vec<PdfParagraph<'_>>` due to unsatisfied trait bounds
Follow-on from #17, #22, #25. Add a
PdfParagraph
object that allows for easier handling of multi-line text with embedded character formatting changes.Ideally, it would be possible to generate a PdfParagraph from an existing set of
PdfPageTextObject
objects, each one containing a formatted fragment of a paragraph.