brendano / stanford_corenlp_pywrapper

151 stars 62 forks source link

Can't get the Chinese models to work #24

Open victoryhb opened 9 years ago

victoryhb commented 9 years ago

Hi! I wonder if anyone has used the Wrapper to parse Chinese texts before? I have the following code:

from stanford_corenlp_pywrapper import sockwrap

parser_path = "/Users/hbyan2/Downloads/stanford-corenlp-full-2015-04-20/*" cn_model_path = "/Users/hbyan2/Downloads/stanford-corenlp-full-2015-04-20/stanford-chinese-corenlp-2015-04-20-models.jar"

p = sockwrap.SockWrap( configdict={ 'annotators':"segment, ssplit, pos, parse", 'customAnnotatorClass.segment': 'edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator', 'segment.model': 'edu/stanford/nlp/models/segmenter/chinese/ctb.gz', 'segment.sighanCorporaDict': 'edu/stanford/nlp/models/segmenter/chinese', 'segment.serDictionary': 'edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz', 'segment.sighanPostProcessing': True, 'ssplit.boundaryTokenRegex': '[.]|[!?]+|[。]|[!?]+', "parse.model": "edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz", "pos.model": "edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger" },
corenlp_jars=[parser_path, cn_model_path] )

p.parse_doc(u"你爱我吗?")

The configs are taken from the default CoreNLP properties for parsing Chinese: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties

When running the Wrapper, I got the following error:

[Server] Started socket server on port 12340 INFO:StanfordSocketWrap:Successful ping. The server has started. INFO:StanfordSocketWrap:Subprocess is ready. Adding Segmentation annotation ... INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/segmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb 你爱我吗? ---> [你, 爱, 我, 吗, ?] java.lang.RuntimeException: don't know how to handle annotator segment at corenlp.JsonPipeline.addAnnoToSentenceObject(JsonPipeline.java:282) at corenlp.JsonPipeline.processTextDocument(JsonPipeline.java:312) at corenlp.SocketServer.runCommand(SocketServer.java:140) at corenlp.SocketServer.socketServerLoop(SocketServer.java:194) at corenlp.SocketServer.main(SocketServer.java:107)

Any idea why this is happening? Many thanks in advance!

brendano commented 9 years ago

the wrapper doesnt support it -- you'd have to modify the java code where the error is happening, to add in the segmentation information to the json output.