clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Unicode issue related to Python integration #662

Closed bgyori closed 4 years ago

bgyori commented 4 years ago

Context: One of the ways in which we use Reach is by creating a far JAR (using sbt assembly), putting it on the CLASSPATH, and then instantiating a Python-Java bridge using the pyjnius package to interact with Reach classes directly.

In the last couple of years a few changes have come together that now cause a problem when non-ASCII Unicode characters are read using this approach. First, in clulab/processors, the original Unicode text is now preserved (it used to be replaced by ASCII before). Second, pyjnius now expects UTF-16 encoding when transforming Java strings into Python strings (it used UTF-8 before). Third, the default JSON outputter in Reach encodes the string in UTF-8 (as far as I understand). Together, this results in a situation where, when the fries output JSON (which contains Unicode characters that appeared in the original text sent for reading) is crossing the Java-Python bridge, we get an error like UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 20512-20513: illegal UTF-16 surrogate.

I am wondering if a solution to this could be to add another argument to the toJSON method of FriesOutputter to select a different encoding, and then adding a line after this: https://github.com/clulab/reach/blob/master/export/src/main/scala/org/clulab/reach/export/fries/FriesOutput.scala#L87, where the chosen encoding is applied to the string before returning. This encoding argument could then be exposed to the level of the ApiRuler so that we can set it when calling it.

MihaiSurdeanu commented 4 years ago

I think this makes sense. But Unicode encoding has always confused me. @kwalcock: do you agree with this solution? If so, I can implement it quickly.

kwalcock commented 4 years ago

I saw this and will get to it soon.

kwalcock commented 4 years ago

I have some doubts about that explanation for line 87, writeJsonToString(uniModel). Java strings are UTF-16 (modulo some optimizations which are supposed to be hidden) and I don't think that writing to a string would involve any encoding to UTF-8. It looks like a standard Java StringWriter is being used. Stranger things have happened, though. The error message complains about a surrogate. It's my understanding (https://unicodebook.readthedocs.io/unicode_encodings.html) that those are only used if something like a UTF-32 character is stuffed into a UTF-16 string. I wonder if that is not being handled correctly. It could be that the input has some lone surrogate that doesn't bother Java, but does Python, or that something in our code is stripping away half of the pair. Do you have the input data that I can look at?

In UTF-16, characters in ranges U+0000—U+D7FF and U+E000—U+FFFD are stored as a single 16 bits unit. Non-BMP characters (range U+10000—U+10FFFF) are stored as “surrogate pairs”, two 16 bits units: an high surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in range U+DC00—U+DFFF). A lone surrogate character is invalid in UTF-16, surrogate characters are always written as pairs (high followed by low).

MihaiSurdeanu commented 4 years ago

Thanks @kwalcock !

bgyori commented 4 years ago

Hi @kwalcock, thanks for looking into this, I am also confused and so my best attempt at an analysis above may not be correct - sorry. You are right about the character range, in one of our unit tests, we, on purpose, use a character in the U+10000-U10FFFF range (we happen to use U+1F4A9 in the test but you can use a nicer emoji like U+1F600 too :) ). Here is the pointer to the specific test: https://github.com/bgyori/indra/blob/master/indra/tests/test_reach.py#L274. The minimal code to reproduce the sequence of events in Python (assuming the REACH FAT JAR is on the CLASSPATH) is:

from jnius import autoclass
api_ruler = autoclass('org.clulab.reach.export.apis.ApiRuler')
result_map = api_ruler.annotateText('MEK1 binds ERK2\U0001F4A9', 'fries')
json_str = result_map.get('result')

it is at this last line that the Unicode error is thrown. We didn't have this issue with old versions of REACH (as reference, I confirmed with v1.3.3) becasue it (i.e., the relevant part of clulab/processors) didn't propagate Unicode characters into the output. The reason it is important that our pipeline is robust to this, is that weird characters like this show up in content from time to time. Thanks again for helping with this!

kwalcock commented 4 years ago

Thanks for providing the excellent example. That UTF-32 character, \U0001F4A9, translates to two UTF-16 characters, \ud83d \udca9. (The tool at https://www.branah.com/unicode-converter seems to work reliably.) These aren't affected by the ScienceUtils and they pass through it, despite the variable name textWithoutUnicode in Preprocess.preprocess. It is really textWithoutRecognizedUnicode. These characters aren't recognized. If one added replaceUnknownUnicodeWithSpaces afterwards, it might solve the problem. They get through that and go through a Lexer (which I have to check) and finally into a CommonTokenStream. By this time, the two unicode characters which need to stay a pair are separated by a space and you get the justified complaint. I don't yet see where the space is added, but I needed to dump this information first.

MihaiSurdeanu commented 4 years ago

Thanks @kwalcock ! So... what is the solution then?

kwalcock commented 4 years ago

It looks like that wasn't quite right. The spaces aren't inserted until the text of a sentence is calculated for json output or other uses. The problem is that the two characters are placed into separate tokens and words and in some instances those tokens/words are concatenated together with spaces used as separators. Once they have been separated, there's a reasonable chance that something like that will happen. It would be better if the tokenizer didn't separate them. I doubt that the tokenizer is easy to change. It may be that the characters should be overwritten with spaces by adding that replaceUnknownUnicodeWithSpaces. I can test that, but it may not be an acceptable solution.

MihaiSurdeanu commented 4 years ago

The tokenizer is easy to change :)

This is the antlr file that contains the tokenizer lexer: https://github.com/clulab/processors/blob/master/main/src/main/java/org/clulab/processors/clu/tokenizer/OpenDomainLexer.g

I think at this position we need to add a rule called SEQ_OF_UNICODES: ... If you tell me what should be in there I can add it.

Then: 2) Install antlr4. For convenience, I am attaching it here as well. antlr4.zip 3) Run scripts/antlr_tokenizer to regenerate the corresponding Java files.

kwalcock commented 4 years ago

@MihaiSurdeanu, something you said this morning made me wonder whether this was still an issue. Are these things being replaced or just moved or something else?

MihaiSurdeanu commented 4 years ago

Yes, please wait. Things are being moved around.

bgyori commented 4 years ago

I confirmed that this is still an issue, and since the restructuring is done, perhaps this is a good time to revisit it. Let me know if I can help!

kwalcock commented 4 years ago

Thank you for checking! I'll wait on word from Mihai in case something is still in flux or plans have changed, etc.

MihaiSurdeanu commented 4 years ago

Things are stable again. Ok to work on this! Thanks @bgyori and @kwalcock !

bgyori commented 4 years ago

Fixed by https://github.com/clulab/processors/pull/379