dasmith / stanford-corenlp-python

Python wrapper for Stanford CoreNLP tools v3.4.1
GNU General Public License v2.0
610 stars 229 forks source link

weird UnicodeDecodeError in StanfordCoreNLP.parse() #19

Open arne-cl opened 9 years ago

arne-cl commented 9 years ago

Hi Dustin,

I just found a really weird error. While corenlp can parse '100 dollars' just fine, '100 yen' causes it to crash.

Python 2.7.3 (default, Feb 27 2014, 19:37:34) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import corenlp
>>> c = corenlp.StanfordCoreNLP()
Loading Models: 5/5                                                                                                                                                                                                                         
>>> c.parse('100 dollars')
'{"sentences": [{"parsetree": "(ROOT (X (NP (CD 100) (NNS dollars))))", "text": "100 dollars", "dependencies": [["root", "ROOT", "dollars"], ["num", "dollars", "100"]], "words": [["100", {"NormalizedNamedEntityTag": "$100.0", "Lemma": "100", "CharacterOffsetEnd": "3", "PartOfSpeech": "CD", "CharacterOffsetBegin": "0", "NamedEntityTag": "MONEY"}], ["dollars", {"NormalizedNamedEntityTag": "$100.0", "Lemma": "dollar", "CharacterOffsetEnd": "11", "PartOfSpeech": "NNS", "CharacterOffsetBegin": "4", "NamedEntityTag": "MONEY"}]]}]}'

>>> c.parse('100 yen')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/corenlp-3.4.1-py2.7.egg/corenlp.py", line 240, in parse
    response = self._parse(text)
  File "/usr/local/lib/python2.7/dist-packages/corenlp-3.4.1-py2.7.egg/corenlp.py", line 230, in _parse
    raise e
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 169: ordinal not in range(128)

Any ideas?

arne-cl commented 9 years ago

The _parse() function stores the correct CoreNLP result in incoming:

>>> print incoming
100 yen
Sentence #1 (2 tokens):
100 yen
[Text=100 CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=CD Lemma=100 NamedEntityTag=MONEY NormalizedNamedEntityTag=¥100.0] [Text=yen CharacterOffsetBegin=4 CharacterOffsetEnd=7 PartOfSpeech=NNS Lemma=yen NamedEntityTag=MONEY NormalizedNamedEntityTag=¥100.0] 
(ROOT
  (X
    (NP (CD 100) (NNS yen))))

root(ROOT-0, yen-2)
num(yen-2, 100-1)

NLP> 

... but parse_parser_results() can't handle the currency symbol ¥, which incoming contains:

>>> parse_parser_results(incoming)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/usr/local/lib/python2.7/dist-packages/corenlp-3.4.1-py2.7.egg/corenlp.pyc in <module>()
----> 1 parse_parser_results(incoming)

/usr/local/lib/python2.7/dist-packages/corenlp-3.4.1-py2.7.egg/corenlp.pyc in parse_parser_results(text)
     73     results = {"sentences": []}
     74     state = STATE_START
---> 75     for line in text.encode('utf-8').split("\n"):
     76         line = line.strip()
     77 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 169: ordinal not in range(128)
cuzzo commented 8 years ago

Hey @arne-cl,

Before the text.encode() line, you could try using something like unidecode of the string first. At least, I ran into a similar error and that fixed the problem.

Hope it works for you.

Cheers,

arne-cl commented 8 years ago

Hi @cuzzo,

I actually made a pull request that fixes this issue, but thanks nonetheless for mentioning unidecode. I have been using a similar library called awesome-slugify, which is e.g. able to translate umlauts instead of dropping them.

Best regards, Arne