baidu / DuReader

Baseline Systems of DuReader Dataset
http://ai.baidu.com/broad/subordinate?dataset=dureader
1.13k stars 308 forks source link

KeyError: 'segmented_paragraphs' #15

Closed SeekPoint closed 6 years ago

SeekPoint commented 6 years ago

mldl@mldlUB1604:~/ub16_prj/DuReader$ cat data/raw/trainset/search.train.json | python3 utils/preprocess.py > data/preprocessed/trainset/search.train.json Traceback (most recent call last): File "utils/preprocess.py", line 217, in find_fake_answer(sample) File "utils/preprocess.py", line 158, in find_fake_answer for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']): KeyError: 'segmented_paragraphs' mldl@mldlUB1604:~/ub16_prj/DuReader$

lkliukai commented 6 years ago

Please use PREPROCESSED version dataset for the raw version does not contain 'segmented_paragraphs' field. Or you can segment Chinese words by yourself or just use char.

fooSynaptic commented 5 years ago

@lkliukai The preprocessed dir does not contain this file.

ShaolinDeng commented 5 years ago

~/github/rasa_opensource/rasa_chinese/DuReader$ cat data/raw/trainset/se arch.train.json | python utils/preprocess.py > data/preprocessed/trainset/search.train.json

Traceback (most recent call last): File "utils/preprocess.py", line 217, in find_fake_answer(sample) File "utils/preprocess.py", line 158, in find_fake_answer for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']): KeyError: 'segmented_paragraphs'