Open Heinenen opened 3 months ago
Another related problem: In some places, lopdf should add whitespace (a single space?) where the PDF doesn't specifically have one.
An example probably demonstrates this best:
fn test_extract() {
let doc = Document::load("extract_text_dkp.pdf").unwrap();
let text = doc.extract_text(&[4]).unwrap();
println!("{}", text);
4InhaltSeiteSozialismusvorstellungen:Sozialismus - die historische Alternativezum Kapitalismus5Als Arbeits- und Diskussionsgrundlagebeschlossene Abänderungs- oderErgänzungsanträge und beschlosseneAnträge13
This is what the page, that the text is extracted from, looks like:
(Example PDF: extract_text_dkp.pdf, taken from
Continuing from the discussion
The responsible code is found at
The way that other PDF viewers handle this is through some heuristics, we can see what pdf.js does in The crux of it being: if the x- or y-coordinate change above a certain threshold (which indicates a new column/new line), a newline is inserted.