WorldModelers / DART

Two Six Labs Data Acquisition & Reasoning Toolkit
0 stars 0 forks source link

PDF extraction introducing stray double carriage returns of unknown cause #8

Open azamanian opened 5 years ago

azamanian commented 5 years ago

@reynoldsm88

Any double carriage return is going to introduce a sentence break during information extraction. So any time a double carriage return in is in the middle of a sentence, that's quite destructive. Sometimes it's obvious what's causing them, but I see them in random places sometimes. For instance in the PDF of document 1f5db65f2b3b158f8b3f0ae53f7c508c

image

The converter is introducing a double carriage return between "and" and "Nutrition Teams". Other line breaks in this bullet points and other similar bullet points do not typically cause double carriage returns. Although there are other stray ones such as after "WFP staff is working alongside NDRMC staff in" in the same document.

pdf source:

https://documents.wfp.org/stellent/groups/Public/documents/ep/WFP284788.pdf?_ga=2.243684229.1030149860.1553624300-1022356052.1547047485

azamanian commented 5 years ago

similar line break in document 1fd62fe4860992606c4e7f0af407bc05 after "displaced since" but not after "refugees in".

image

azamanian commented 5 years ago

After looking at another example (34296e8b8a7d5e53a5522fc6ae22fcf0), I think the converter is using a heuristic where it looks at the next character after an end of line in certain types of sections. If that next character is a number or a capital letter, it inserts a double carriage return.