Open azamanian opened 5 years ago
similar line break in document 1fd62fe4860992606c4e7f0af407bc05 after "displaced since" but not after "refugees in".
After looking at another example (34296e8b8a7d5e53a5522fc6ae22fcf0), I think the converter is using a heuristic where it looks at the next character after an end of line in certain types of sections. If that next character is a number or a capital letter, it inserts a double carriage return.
@reynoldsm88
Any double carriage return is going to introduce a sentence break during information extraction. So any time a double carriage return in is in the middle of a sentence, that's quite destructive. Sometimes it's obvious what's causing them, but I see them in random places sometimes. For instance in the PDF of document 1f5db65f2b3b158f8b3f0ae53f7c508c
The converter is introducing a double carriage return between "and" and "Nutrition Teams". Other line breaks in this bullet points and other similar bullet points do not typically cause double carriage returns. Although there are other stray ones such as after "WFP staff is working alongside NDRMC staff in" in the same document.
pdf source:
https://documents.wfp.org/stellent/groups/Public/documents/ep/WFP284788.pdf?_ga=2.243684229.1030149860.1553624300-1022356052.1547047485