baidu / DuReader

Baseline Systems of DuReader Dataset
http://ai.baidu.com/broad/subordinate?dataset=dureader
1.13k stars 308 forks source link

run cat data/raw/trainset/search.train.json | python utils/preprocess.py > data/preprocessed/trainset/search.train.json #49

Closed Apollo2Mars closed 4 years ago

Apollo2Mars commented 5 years ago

Traceback (most recent call last): File "utils/preprocess.py", line 217, in find_fake_answer(sample) File "utils/preprocess.py", line 158, in find_fake_answer for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']): KeyError: 'segmented_paragraphs'

when I run the script in the readme, this error occur, please check

JYZ122 commented 5 years ago

I have the same problem, are you good now?

Apollo2Mars commented 5 years ago

I have the same problem, are you good now? I still have the problem But I find the download script can download the preprocessed data

JYZ122 commented 5 years ago

Are you chinese?you can add my WeChat if you are chinese.Wechat:JIMWIJ

lkliukai commented 5 years ago

Pls refer to issue #15 , and or you can split words simply by chars.

JYZ122 commented 5 years ago

What should I do if I want to convert the data set in raw to the data in preprocessed?

JYZ122 commented 5 years ago

zhangyan@ubuntu:~/jyz/DuReader-master$ cat data/preprocessed/trainset/search.train.json | python utils/preprocess.py > data/preprocessed/trainset/search1.train.json Traceback (most recent call last): File "utils/preprocess.py", line 218, in print(json.dumps(sample, encoding='utf8', ensure_ascii=False)) File "/home/haoyu/env/anaconda3/lib/python3.6/json/init.py", line 238, in dumps **kw).encode(obj) TypeError: init() got an unexpected keyword argument 'encoding'

This problem occured

Apollo2Mars commented 5 years ago

Are you chinese?you can add my WeChat if you are chinese.Wechat:JIMWIJ

OK, I'm Chinese, I add your WeChat

lsq357 commented 5 years ago

run cat data/raw/trainset/search.zhidao.json | python utils/preprocess.py > data/preprocessed/trainset/zhidao.train.json may be sucess

yuanyehome commented 5 years ago

这个需要用py2来执行……py3的json是没有encoding参数的,可以在run.sh里面把python后面全都加一个2(如果默认是3的话)

JYZ122 commented 5 years ago

trainset中的内容为什么在训练的时候question有的利用不上,比如我训练集有1000个问题,然后显示只有900个问题可以利用,这是为什么?

Apollo2Mars commented 5 years ago

这个需要用py2来执行……py3的json是没有encoding参数的,可以在run.sh里面把python后面全都加一个2(如果默认是3的话)

谢谢,我去试验一下

youngawesome commented 5 years ago

这个需要用py2来执行……py3的json是没有encoding参数的,可以在run.sh里面把python后面全都加一个2(如果默认是3的话)

谢谢,我去试验一下

你好,请问这个报错后来解决了吗?如果解决了是如何解决的呢?

foreversolar commented 5 years ago

所以这个问题还是没有解决啊