WorldModelers / DART

Two Six Labs Data Acquisition & Reasoning Toolkit
0 stars 0 forks source link

PDF extraction puts text from insets within sentences at page breaks #6

Open azamanian opened 5 years ago

azamanian commented 5 years ago

@reynoldsm88

In the pdf for document 0038a82f8ceeef0999277b53c1c98248, we see:

image

There is a sentence spanning page 6/7 --

"...learning and teaching environment, use of the..."

in the cdr file, this sentence gets broken up by the text from the inset "Community volunteers demonstrating water treatment", as well as a double carriage return (which might be a separate issue).

pdf source is here:

https://www.unicef.org/appeals/files/UNICEF_South_Sudan_Humanitarian_SitRep_19_May_2016.pdf