`extract_text` inserts newlines where it shouldn't

J-F-Liu / lopdf

A Rust library for PDF document manipulation.

MIT License

1.67k stars 176 forks source link

Another related problem: In some places, lopdf should add whitespace (a single space?) where the PDF doesn't specifically have one.

An example probably demonstrates this best:

#[test]
fn test_extract() {
    let doc = Document::load("extract_text_dkp.pdf").unwrap();
    let text = doc.extract_text(&[4]).unwrap();
    println!("{}", text);
}

prints

4InhaltSeiteSozialismusvorstellungen:Sozialismus - die historische Alternativezum Kapitalismus5Als Arbeits- und Diskussionsgrundlagebeschlossene Abänderungs- oderErgänzungsanträge und beschlosseneAnträge13

This is what the page, that the text is extracted from, looks like:

(Example PDF: extract_text_dkp.pdf, taken from https://github.com/J-F-Liu/lopdf/issues/217#issuecomment-1457367413)

J-F-Liu / lopdf

`extract_text` inserts newlines where it shouldn't #292