baidu / DuReader

Baseline Systems of DuReader Dataset
http://ai.baidu.com/broad/subordinate?dataset=dureader
1.13k stars 308 forks source link

modify NoneType error #45

Closed yongbowin closed 5 years ago

yongbowin commented 5 years ago

modify NoneType error of var para_id

yongbowin commented 5 years ago

When I run command sh run.sh --para_extraction, the following error occur:

Start paragraph extraction, this may take a few hours
Source dir: ../data/preprocessed
Target dir: ../data/extracted
Processing trainset
Processing devset
Processing testset
Traceback (most recent call last):
  File "paragraph_extraction.py", line 197, in <module>
    paragraph_selection(sample, mode)
  File "paragraph_extraction.py", line 111, in paragraph_selection
    status = dup_remove(doc)
  File "paragraph_extraction.py", line 66, in dup_remove
    if p_idx < para_id:
TypeError: '<' not supported between instances of 'int' and 'NoneType'
Traceback (most recent call last):
  File "paragraph_extraction.py", line 197, in <module>
    paragraph_selection(sample, mode)
  File "paragraph_extraction.py", line 111, in paragraph_selection
    status = dup_remove(doc)
  File "paragraph_extraction.py", line 66, in dup_remove
    if p_idx < para_id:
TypeError: '<' not supported between instances of 'int' and 'NoneType'

So, I New a pull request to modify this bug.

SunnyMarkLiu commented 5 years ago

This is caused by dup_remove in paragraph_extraction.py when processing the testset. The default method is find most_related_para when process the train and dev set, while the testset don't have most_related_para field, and we need find the paragraph that most related to the question. So, just modify:

def dup_remove(doc, question=None):
    """python
    For each document, remove the duplicated paragraphs
    Args:
        doc: a doc in the sample
    Returns:
        bool
    Raises:
        None
    """
    paragraphs_his = {}
    del_ids = []
    para_id = None
    # ----------------- modify start -----------------
    if 'most_related_para' in doc:  # for trainset and devset
        para_id = doc['most_related_para']
    else:  # for testset
        para_id = find_best_question_match(doc, question)
    # ----------------- modify end -----------------

    doc['paragraphs_length'] = []
    for p_idx, (segmented_paragraph, paragraph_score) in enumerate(zip(doc["segmented_paragraphs"],
                                                                       doc["segmented_paragraphs_scores"])):
        doc['paragraphs_length'].append(len(segmented_paragraph))
        paragraph = ''.join(segmented_paragraph)
        if paragraph in paragraphs_his:  
            del_ids.append(p_idx)
            if p_idx == para_id:
                para_id = paragraphs_his[paragraph]
            continue
        paragraphs_his[paragraph] = p_idx

    # delete
    prev_del_num = 0
    del_num = 0
    for p_idx in del_ids:
        if p_idx < para_id:
            prev_del_num += 1
        del doc["segmented_paragraphs"][p_idx - del_num]
        del doc["segmented_paragraphs_scores"][p_idx - del_num]
        del doc['paragraphs_length'][p_idx - del_num]
        del_num += 1
    if len(del_ids) != 0:
        if 'most_related_para' in doc:
            doc['most_related_para'] = para_id - prev_del_num
        doc['paragraphs'] = []
        for segmented_para in doc["segmented_paragraphs"]:
            paragraph = ''.join(segmented_para)
            doc['paragraphs'].append(paragraph)
        return True
    else:
        return False
yongbowin commented 5 years ago

This is caused by dup_remove in paragraph_extraction.py when processing the testset. The default method is find most_related_para when process the train and dev set, while the testset don't have most_related_para field, and we need find the paragraph that most related to the question. So, just modify:

def dup_remove(doc, question=None):
    """python
    For each document, remove the duplicated paragraphs
    Args:
        doc: a doc in the sample
    Returns:
        bool
    Raises:
        None
    """
    paragraphs_his = {}
    del_ids = []
    para_id = None
    # ----------------- modify start -----------------
    if 'most_related_para' in doc:  # for trainset and devset
        para_id = doc['most_related_para']
    else:  # for testset
        para_id = find_best_question_match(doc, question)
    # ----------------- modify end -----------------

    doc['paragraphs_length'] = []
    for p_idx, (segmented_paragraph, paragraph_score) in enumerate(zip(doc["segmented_paragraphs"],
                                                                       doc["segmented_paragraphs_scores"])):
        doc['paragraphs_length'].append(len(segmented_paragraph))
        paragraph = ''.join(segmented_paragraph)
        if paragraph in paragraphs_his:  
            del_ids.append(p_idx)
            if p_idx == para_id:
                para_id = paragraphs_his[paragraph]
            continue
        paragraphs_his[paragraph] = p_idx

    # delete
    prev_del_num = 0
    del_num = 0
    for p_idx in del_ids:
        if p_idx < para_id:
            prev_del_num += 1
        del doc["segmented_paragraphs"][p_idx - del_num]
        del doc["segmented_paragraphs_scores"][p_idx - del_num]
        del doc['paragraphs_length'][p_idx - del_num]
        del_num += 1
    if len(del_ids) != 0:
        if 'most_related_para' in doc:
            doc['most_related_para'] = para_id - prev_del_num
        doc['paragraphs'] = []
        for segmented_para in doc["segmented_paragraphs"]:
            paragraph = ''.join(segmented_para)
            doc['paragraphs'].append(paragraph)
        return True
    else:
        return False

Yes, I reviewed these and found that you are right. Thx for your comment.