AmoseKang / DrQA_cn

Other
37 stars 22 forks source link

分词器的问题 #10

Open zsp1993 opened 5 years ago

zsp1993 commented 5 years ago

你好,我下载了中文jar包放到了facebook开源的DrQA tokenizer对应文件夹下面,把整个文件夹拷到了您开源的DrQA_cn对应目录,处理时报错 process('江泽明是谁?', doc_n=1, pred_n=1, net_n=1) 01/10/2019 10:51:16 AM: [ [question after filting : 江泽明是谁? ] ] 01/10/2019 10:51:17 AM: [ [retreive from net : 1 | expect : 1] ] =================raw text================== �״��������� ���ɻ��3���齱���ᣬ100%�н��� ��ѡ��1����������Ĺؼ��ʣ�����������ϡ�Ҳ��ֱ�ӵ㡰�������ϡ������������⡣ ���ǻ�ͨ����Ϣ������ȷ�ʽ���콫�ٱ����֪ͨ���� �����ʺ�״̬���� ��л�������ǵ�֧�� ��ICP֤030173��-1�����ġ�2013��0934-983��©2019Baiduʹ�ðوٴ�ǰ�ض�|֪��Э��|�وٴ�֪��Ʒ�ƺ���

Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 99, in expect_loop incoming = spawn.read_nonblocking(spawn.maxread, timeout) File "/usr/local/lib/python3.5/dist-packages/pexpect/pty_spawn.py", line 462, in read_nonblocking raise TIMEOUT('Timeout exceeded.') pexpect.exceptions.TIMEOUT: Timeout exceeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "scripts/pipeline/sinteractive.py", line 67, in process docTopN=doc_n, netTopN=net_n) File "/home/zhangshaopeng/github/DrQA_cn/drqa/pipeline/simpleDrQA.py", line 50, in predict ans.extend(process(text)) File "/home/zhangshaopeng/github/DrQA_cn/drqa/pipeline/simpleDrQA.py", line 33, in process line, query, candidates=None, top_n=qasTopN) File "/home/zhangshaopeng/github/DrQA_cn/drqa/reader/predictor.py", line 86, in predict results = self.predict_batch([(document, question, candidates,)], top_n) File "/home/zhangshaopeng/github/DrQA_cn/drqa/reader/predictor.py", line 105, in predict_batch q_tokens = list(map(self.tokenizer.tokenize, questions)) File "/home/zhangshaopeng/github/DrQA_cn/drqa/tokenizers/Zh_tokenizer.py", line 105, in tokenize self.corenlp.expect_exact('NLP>', searchwindowsize=100) File "/usr/local/lib/python3.5/dist-packages/pexpect/spawnbase.py", line 390, in expect_exact return exp.expect_loop(timeout) File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 107, in expect_loop return self.timeout(e) File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 70, in timeout raise TIMEOUT(msg) pexpect.exceptions.TIMEOUT: Timeout exceeded. <pexpect.pty_spawn.spawn object at 0x7feb66984128> command: /bin/bash args: ['/bin/bash'] buffer (last 100 chars): b'-437: /home/zhangshaopeng/github/DrQA_cn\x07root@kml-dtmachine-437:/home/zhangshaopeng/github/DrQA_cn#' before (last 100 chars): b'-437: /home/zhangshaopeng/github/DrQA_cn\x07root@kml-dtmachine-437:/home/zhangshaopeng/github/DrQA_cn#' after: <class 'pexpect.exceptions.TIMEOUT'> match: None match_index: None exitstatus: None flag_eof: False pid: 15919 child_fd: 12 closed: False timeout: 60 delimiter: <class 'pexpect.exceptions.EOF'> logfile: None logfile_read: None logfile_send: None maxread: 100000 ignorecase: False searchwindowsize: None delaybeforesend: 0 delayafterclose: 0.1 delayafterterminate: 0.1 searcher: searcher_string: 0: "b'NLP>'"

zsp1993 commented 5 years ago

是分词器哪里没设置对吗,好久了没弄明白

AmoseKang commented 5 years ago

可能是我的百度爬虫有问题,编码看起来也有问题。分词器测试请直接把命令print出来,先测试命令行上能否运行。

zsp1993 commented 5 years ago

你说的命令行运行print是这个吗? from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer() tok.tokenize('hello world').words()

AmoseKang commented 5 years ago

Zh_tokenizer.py

cmd = ['java', '-mx' + self.mem, '-cp', '\'%s\'' % self.classpath,
               'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-props',
               'StanfordCoreNLP-chinese.properties',
               '-annotators', annotators, '-tokenize.options', options,
               '-outputFormat', 'json', '-prettyPrint', 'false']
# print(cmd)  

这里面还要改一行路径

zsp1993 commented 5 years ago

您好,这个路径我已经改了(别的没改) self.classpath = '/Users/zhangshaopeng/pyproject/github/DrQA_cn/data/corenlp/*' 测试的话是要运行这个文件吗?

zsp1993 commented 5 years ago

这个是取消print的注释后输出的内容 ['java', '-mx2g', '-cp', "'/home/zhangshaopeng/github/DrQA_cn/data/corenlp/*'", 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-props', 'StanfordCoreNLP-chinese.properties', '-annotators', 'tokenize,ssplit,pos,lemma,ner', '-tokenize.options', 'untokenizable=noneDelete,invertible=true', '-outputFormat', 'json', '-prettyPrint', 'false'] 直接粘贴到命令行后提示 bash: [java,: 未找到命令

zsp1993 commented 5 years ago

去掉标点和括号执行后报错如下: java -mx2g -cp /home/zhangshaopeng/github/DrQA_cn/data/corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators tokenize ssplit pos lemma ner -tokenize.options untokenizable=noneDelete invertible=true -outputFormat json -prettyPrint false 错误: 找不到或无法加载主类 .home.zhangshaopeng.github.DrQA_cn.data.corenlp.javax.json-api-1.0-sources.jar

这个文件夹/home/zhangshaopeng/github/DrQA_cn/data/corenlp/中的内容如下: ejml-0.23.jar jollyday-0.4.9-sources.jar stanford-chinese-corenlp-2017-06-09-models.jar xom-1.2.10-src.jar javax.json-api-1.0-sources.jar jollyday.jar stanford-corenlp-3.8.0.jar xom.jar javax.json.jar protobuf.jar stanford-corenlp-3.8.0-javadoc.jar joda-time-2.9-sources.jar slf4j-api.jar stanford-corenlp-3.8.0-models.jar joda-time.jar slf4j-simple.jar stanford-corenlp-3.8.0-sources.jar