WorldModelers / DART

Two Six Labs Data Acquisition & Reasoning Toolkit
0 stars 0 forks source link

PDF extraction moves some within-sentence text at the end of the sentence #9

Open azamanian opened 5 years ago

azamanian commented 5 years ago

@reynoldsm88

This might be extremely rare, but just in case it happens more frequently than I think, I'll create an issue to record it.

The pdf of document 202f5c6aab38cd5b01fcb7a66206b1b5 has this:

202f

Strangely, the converter put the "424" in this sentence after the period at the end of the sentence. It maybe thinks "424" is a page number since it's in the lower left hand corner of a section of the pdf.

pdf source:

http://www.reachresourcecentre.info/system/files/resource-documents/reach_ssd_factsheet_port_monitoring_nyal_november_2017_6.pdf

azamanian commented 5 years ago

Something similar happening in 3313b39f63091625e6ed544b6bae3f61

3313

The "1" in "1-15 September" is moved to the end of the line.