Closed reynoldsm88 closed 5 years ago
After this gets fixed, document 21359c9065acfe9413e49b962e8a581f is worth a look. Most of the spaces in the document look they're some unicode character that the converter doesn't know how to deal with.
@kwalcock @azamanian hey, i apologize for not getting back to you on this. we got caught up in preparing for the PI meeting last week. this is on our radar i will update you when we fit it into our schedule and start working on it
This doesn't seem to have been handled yet.
@kwalcock We are still looking into this issue.
In the CdrDocument json would you prefer the unicode character to be there as unescaped code or actual character.
For example:
"text": "★"
"text": \u2605"
I have a slight preference for the former, assuming that the data is UTF-8 encoded, because it is more readily recognized by a human (with a reasonable editor).
@kwalcock @azamanian We believe we found the root issue behind this and have patched our system. @johnhungerford is in the process of reprocessing the document. we will update the zip exports and elasticsearch containers once this is finished
@johnhungerford has pushed the latest document set. before the meeting on Weds 10/9/2019 can we confirm that there are no more unicode issues?
I'll check in the next couple of hours.
It looks good to me. Some 41922 question marks seem to have been replaced and the only ones I can find follow actual questions. Thanks!
awesome thanks @kwalcock i'm going to close this issue. if we run into something else feel free to open a new issue
In general, search the extracted_text field of the output of an elasticsearch query for a question mark. These are often substituted for a unicode character, often a smart quote or ñ. Examples:
This wasn't the case with almost exactly the same code when run against the Jataware version.