Open Heinenen opened 3 months ago
Another related problem: In some places, lopdf should add whitespace (a single space?) where the PDF doesn't specifically have one.
An example probably demonstrates this best:
#[test]
fn test_extract() {
let doc = Document::load("extract_text_dkp.pdf").unwrap();
let text = doc.extract_text(&[4]).unwrap();
println!("{}", text);
}
prints
4InhaltSeiteSozialismusvorstellungen:Sozialismus - die historische Alternativezum Kapitalismus5Als Arbeits- und Diskussionsgrundlagebeschlossene Abänderungs- oderErgänzungsanträge und beschlosseneAnträge13
This is what the page, that the text is extracted from, looks like:
(Example PDF: extract_text_dkp.pdf, taken from https://github.com/J-F-Liu/lopdf/issues/217#issuecomment-1457367413)
Continuing from the discussion https://github.com/J-F-Liu/lopdf/issues/125#issuecomment-2235840244.
The responsible code is found at https://github.com/J-F-Liu/lopdf/blob/master/src/parser_aux.rs#L94.
The way that other PDF viewers handle this is through some heuristics, we can see what pdf.js does in https://github.com/mozilla/pdf.js/blob/341a0b6d477d2909fcb14bcbfdf0d2fd37406cb0/src/core/evaluator.js#L2966. The crux of it being: if the x- or y-coordinate change above a certain threshold (which indicates a new column/new line), a newline is inserted.