dasmith / stanford-corenlp-python

Python wrapper for Stanford CoreNLP tools v3.4.1
GNU General Public License v2.0
612 stars 228 forks source link

Certain characters lead to Internal Error #15

Open PiJoules opened 10 years ago

PiJoules commented 10 years ago

I am trying to parse the sentence

WASHINGTON — Republicans on Thursday vowed a swift and forceful response to the executive action on immigration that President Obama is to announce in a prime-time address, accusing the president of exceeding the power of his office and promising a legislative fight when they take full control of Congress next year.

but I keep getting the error

Traceback (most recent call last):
  File "client.py", line 19, in <module>
    result = nlp.parse(text2)
  File "client.py", line 12, in parse
    return json.loads(self.server.parse(text))
  File "/Users/Pi_Joules/projects/kompact/stanford-corenlp-python/jsonrpc.py", line 934, in __call__
    return self.__req(self.__name, args, kwargs)
  File "/Users/Pi_Joules/projects/kompact/stanford-corenlp-python/jsonrpc.py", line 907, in __req
    resp = self.__data_serializer.loads_response( resp_str )
  File "/Users/Pi_Joules/projects/kompact/stanford-corenlp-python/jsonrpc.py", line 626, in     loads_response
    raise RPCInternalError(error_data)
jsonrpc.RPCInternalError: <RPCFault -32603: 'Internal error.' (None)>

The error doesn't appear though when I remove the EM Dash () in the first sentence. The same goes for curly single and double quotes like “”. Is there any way I can still parse these characters in this wrapper?

Thanks

h10r commented 9 years ago

One workaround for this I found was to add a line like:

line = line.decode('ascii', 'ignore')

Which converts your line into ASCII (in your case, PiJoules, replace line with text2).

maali-mnasri commented 9 years ago

The parser raises an RPC internal error -32603 whenever I try to parse this sentence:

"They are now examining whether Ahmed drove one of the vehicles used in the Dar es Salaam bombing and whether the 400 pounds of explosives used in both blasts came into Tanzania in a shipment of rice imported by one of his companies."

When I splitted the sentence in parts and tried to parse each part, I figured out that the bi-gram "400 pounds" is causing that error. The parser is unable to parse it.

The problem is not solved when I encode my sentence in ascii. Does anyone know how to fix this issue?

rgtjf commented 8 years ago

It also happens when I try to parse Vivendi shares closed 1.9 percent at 15.80 euros in Paris after falling 3.6 percent on Monday. and euros causes the error.

maali-mnasri commented 8 years ago

@rgtjf I fixed the problem when I tracked the exchanged data between all the functions until I found where the error came from. The parser in corenlp.py returns a parsing result in a json format. In my sentence, it replaced the word "pounds" with its symbol "£". Later, in the same script, the function parse_parser_results(text) (line67) is unable to read that symbol. In your case I guess "euros" is converted to its symbol "€" which causes the problem. You can print the "text" data just after line 67 to see the parsing. To solve the problem, I converted the text data in parse_parser_results to unicode with utf-8 encoding. In fine, add : text=unicode(text,"utf-8") before the loop (line 75) in parse_parser_results(text) (line67) of corenlp.py and it will work.

rgtjf commented 8 years ago

@maali-mnasri Great! Thank you.

hayj commented 8 years ago

Thank you @maali-mnasri

wind09 commented 7 years ago

Thank you @maali-mnasri , I met the same problem

zhanyuanyang commented 6 years ago

Thank you @maali-mnasri