WorldModelers / DART

Two Six Labs Data Acquisition & Reasoning Toolkit
0 stars 0 forks source link

PDF extraction breaks up titles with double carriage returns #11

Open azamanian opened 5 years ago

azamanian commented 5 years ago

@reynoldsm88

This happens for certain types of titles where a double carriage return gets inserted in the middle. For example in 8a395db08d4beeeecdca1c4cb83b93de, we have a title:

image

the PDF extractor puts a double carriage return between "South Sudan" and "Humanitarian..." It would be helpful to have this full title joined in a single sentence, which the double carriage return will preclude.

pdf source:

https://www.unicef.org/appeals/files/UNICEF_South_Sudan_Humanitarian_SitRep_10_March_2016.pdf