PDF extraction does not handle unicode characters correctly

WorldModelers / DART

Two Six Labs Data Acquisition & Reasoning Toolkit

0 stars 0 forks source link

PDF extraction does not handle unicode characters correctly #2

Closed reynoldsm88 closed 5 years ago

kwalcock commented 5 years ago

In general, search the extracted_text field of the output of an elasticsearch query for a question mark. These are often substituted for a unicode character, often a smart quote or ñ. Examples:

"document_id" : "0038a82f8ceeef0999277b53c1c98248",
UNICEF?s
teacher?s
"document_id" : "007431e54f2eeb11f24b86a0bf8376fb",
?famine-like?
cluster?s
"document_id" : "03801d58d1cfe085876d8bdbedd08e80",
El Ni?o
"document_id" : "046e478dd896f63646be8f84ae00f1f2",
Ni?o

This wasn't the case with almost exactly the same code when run against the Jataware version.

azamanian commented 5 years ago

After this gets fixed, document 21359c9065acfe9413e49b962e8a581f is worth a look. Most of the spaces in the document look they're some unicode character that the converter doesn't know how to deal with.

reynoldsm88 commented 5 years ago

@kwalcock @azamanian hey, i apologize for not getting back to you on this. we got caught up in preparing for the PI meeting last week. this is on our radar i will update you when we fit it into our schedule and start working on it

kwalcock commented 5 years ago

This doesn't seem to have been handled yet.

yanzv commented 5 years ago

@kwalcock We are still looking into this issue.
In the CdrDocument json would you prefer the unicode character to be there as unescaped code or actual character.
For example:

"text": "★"
"text": \u2605"

kwalcock commented 5 years ago

I have a slight preference for the former, assuming that the data is UTF-8 encoded, because it is more readily recognized by a human (with a reasonable editor).

reynoldsm88 commented 5 years ago

@kwalcock @azamanian We believe we found the root issue behind this and have patched our system. @johnhungerford is in the process of reprocessing the document. we will update the zip exports and elasticsearch containers once this is finished

reynoldsm88 commented 5 years ago

@johnhungerford has pushed the latest document set. before the meeting on Weds 10/9/2019 can we confirm that there are no more unicode issues?

kwalcock commented 5 years ago

I'll check in the next couple of hours.

kwalcock commented 5 years ago

It looks good to me. Some 41922 question marks seem to have been replaced and the only ones I can find follow actual questions. Thanks!

reynoldsm88 commented 5 years ago

awesome thanks @kwalcock i'm going to close this issue. if we run into something else feel free to open a new issue