J-F-Liu / lopdf

A Rust library for PDF document manipulation.
MIT License
1.67k stars 176 forks source link

`extract_text` inserts newlines where it shouldn't #292

Open Heinenen opened 3 months ago

Heinenen commented 3 months ago

Continuing from the discussion https://github.com/J-F-Liu/lopdf/issues/125#issuecomment-2235840244.

The responsible code is found at https://github.com/J-F-Liu/lopdf/blob/master/src/parser_aux.rs#L94.

The way that other PDF viewers handle this is through some heuristics, we can see what pdf.js does in https://github.com/mozilla/pdf.js/blob/341a0b6d477d2909fcb14bcbfdf0d2fd37406cb0/src/core/evaluator.js#L2966. The crux of it being: if the x- or y-coordinate change above a certain threshold (which indicates a new column/new line), a newline is inserted.

Heinenen commented 3 months ago

Another related problem: In some places, lopdf should add whitespace (a single space?) where the PDF doesn't specifically have one.

An example probably demonstrates this best:

#[test]
fn test_extract() {
    let doc = Document::load("extract_text_dkp.pdf").unwrap();
    let text = doc.extract_text(&[4]).unwrap();
    println!("{}", text);
}

prints

4InhaltSeiteSozialismusvorstellungen:Sozialismus - die historische Alternativezum Kapitalismus5Als Arbeits- und Diskussionsgrundlagebeschlossene Abänderungs- oderErgänzungsanträge und beschlosseneAnträge13

This is what the page, that the text is extracted from, looks like: image

(Example PDF: extract_text_dkp.pdf, taken from https://github.com/J-F-Liu/lopdf/issues/217#issuecomment-1457367413)