Closed yongbowin closed 5 years ago
When I run command sh run.sh --para_extraction
, the following error occur:
Start paragraph extraction, this may take a few hours
Source dir: ../data/preprocessed
Target dir: ../data/extracted
Processing trainset
Processing devset
Processing testset
Traceback (most recent call last):
File "paragraph_extraction.py", line 197, in <module>
paragraph_selection(sample, mode)
File "paragraph_extraction.py", line 111, in paragraph_selection
status = dup_remove(doc)
File "paragraph_extraction.py", line 66, in dup_remove
if p_idx < para_id:
TypeError: '<' not supported between instances of 'int' and 'NoneType'
Traceback (most recent call last):
File "paragraph_extraction.py", line 197, in <module>
paragraph_selection(sample, mode)
File "paragraph_extraction.py", line 111, in paragraph_selection
status = dup_remove(doc)
File "paragraph_extraction.py", line 66, in dup_remove
if p_idx < para_id:
TypeError: '<' not supported between instances of 'int' and 'NoneType'
So, I New a pull request to modify this bug.
This is caused by dup_remove
in paragraph_extraction.py
when processing the testset. The default method is find most_related_para
when process the train and dev set, while the testset don't have most_related_para
field, and we need find the paragraph that most related to the question. So, just modify:
def dup_remove(doc, question=None):
"""python
For each document, remove the duplicated paragraphs
Args:
doc: a doc in the sample
Returns:
bool
Raises:
None
"""
paragraphs_his = {}
del_ids = []
para_id = None
# ----------------- modify start -----------------
if 'most_related_para' in doc: # for trainset and devset
para_id = doc['most_related_para']
else: # for testset
para_id = find_best_question_match(doc, question)
# ----------------- modify end -----------------
doc['paragraphs_length'] = []
for p_idx, (segmented_paragraph, paragraph_score) in enumerate(zip(doc["segmented_paragraphs"],
doc["segmented_paragraphs_scores"])):
doc['paragraphs_length'].append(len(segmented_paragraph))
paragraph = ''.join(segmented_paragraph)
if paragraph in paragraphs_his:
del_ids.append(p_idx)
if p_idx == para_id:
para_id = paragraphs_his[paragraph]
continue
paragraphs_his[paragraph] = p_idx
# delete
prev_del_num = 0
del_num = 0
for p_idx in del_ids:
if p_idx < para_id:
prev_del_num += 1
del doc["segmented_paragraphs"][p_idx - del_num]
del doc["segmented_paragraphs_scores"][p_idx - del_num]
del doc['paragraphs_length'][p_idx - del_num]
del_num += 1
if len(del_ids) != 0:
if 'most_related_para' in doc:
doc['most_related_para'] = para_id - prev_del_num
doc['paragraphs'] = []
for segmented_para in doc["segmented_paragraphs"]:
paragraph = ''.join(segmented_para)
doc['paragraphs'].append(paragraph)
return True
else:
return False
This is caused by
dup_remove
inparagraph_extraction.py
when processing the testset. The default method is findmost_related_para
when process the train and dev set, while the testset don't havemost_related_para
field, and we need find the paragraph that most related to the question. So, just modify:def dup_remove(doc, question=None): """python For each document, remove the duplicated paragraphs Args: doc: a doc in the sample Returns: bool Raises: None """ paragraphs_his = {} del_ids = [] para_id = None # ----------------- modify start ----------------- if 'most_related_para' in doc: # for trainset and devset para_id = doc['most_related_para'] else: # for testset para_id = find_best_question_match(doc, question) # ----------------- modify end ----------------- doc['paragraphs_length'] = [] for p_idx, (segmented_paragraph, paragraph_score) in enumerate(zip(doc["segmented_paragraphs"], doc["segmented_paragraphs_scores"])): doc['paragraphs_length'].append(len(segmented_paragraph)) paragraph = ''.join(segmented_paragraph) if paragraph in paragraphs_his: del_ids.append(p_idx) if p_idx == para_id: para_id = paragraphs_his[paragraph] continue paragraphs_his[paragraph] = p_idx # delete prev_del_num = 0 del_num = 0 for p_idx in del_ids: if p_idx < para_id: prev_del_num += 1 del doc["segmented_paragraphs"][p_idx - del_num] del doc["segmented_paragraphs_scores"][p_idx - del_num] del doc['paragraphs_length'][p_idx - del_num] del_num += 1 if len(del_ids) != 0: if 'most_related_para' in doc: doc['most_related_para'] = para_id - prev_del_num doc['paragraphs'] = [] for segmented_para in doc["segmented_paragraphs"]: paragraph = ''.join(segmented_para) doc['paragraphs'].append(paragraph) return True else: return False
Yes, I reviewed these and found that you are right. Thx for your comment.
modify NoneType error of var para_id