amazon-science / irgr

Apache License 2.0
11 stars 0 forks source link

Asking for the retrieval results & Question about the Recall@25 metric #2

Open unnormalization opened 2 years ago

unnormalization commented 2 years ago

Hi! Thanks for your great work and your code! I have a question about the Recall@25 metric (the compute_retrieval_metrics function in entailment_retrieval.ipynb).

It seems that you calculate the recall by recall = tot_sent_correct / float(tot_sent):

$$ {\bf Recall} = \frac {\bf the\ number\ of\ TP(true\ positive)\ of\ all\ samples} {\bf the\ number\ of\ retrieved\ sentences\ of\ all\ samples}. $$

However, I want to calculate the recall in the following way:

$$ {\bf Recall} = {\frac 1 N} \sum_{N\ {\bf samples}} \frac {\bf the\ number\ of\ TP\ of\ one\ sample} {\bf the\ number\ of\ retrieved\ sentences\ of\ one\ sample}. $$

Could you please share your retrieval results (corresponding to the Table 1 in your paper)? They would help me a lot since I can calculate my own recall metric.

def compute_retrieval_metrics(retrieved_sentences_lst, split = 'test', verbose = False):
    dataset = entail_dataset.data['task_1'][split]
    context_mapping = create_context_mapping(dataset)

    assert len(retrieved_sentences_lst) == len(dataset)

    tot_sent = 0
    tot_sent_correct = 0
    tot_sent_missing = 0
    tot_no_missing = 0
    tot_sent_not_in_wt = 0
    tot_missing_sent_height = 0
    tot_correct_sent_height = 0

    correct_retrieved_lst = []
    errors_lst = [] # in retreived but not in gold
    missing_lst = [] # in gold but not in retrieved

    for ret_sentences, dp_context_mapping, datapoint in zip(retrieved_sentences_lst, context_mapping, dataset):
        correct_retrieved = []
        errors = []
        for ret_sentence in ret_sentences:
            is_correct = False
            for mapping_texts in dp_context_mapping.values():
                if ret_sentence in mapping_texts.values():
                    is_correct = True
                    if mapping_texts['text'] not in correct_retrieved:
                        correct_retrieved.append(mapping_texts['text'])
                    if len(mapping_texts['wt_p_text_uuid']) < 2:
                        tot_sent_not_in_wt += 1
                    break
            if not is_correct:
                errors.append(ret_sentence)
        all_sents = [v['text'] for v in dp_context_mapping.values()]
        missing = list(set(all_sents) - set(correct_retrieved))

        correct_retrieved_lst.append(correct_retrieved)
        errors_lst.append(errors)
        missing_lst.append(missing)

        tot_sent += len(dp_context_mapping.keys())
        tot_sent_correct += len(correct_retrieved)
        tot_sent_missing += len(missing)
        tot_no_missing += 0 if len(missing) > 0 else 1

        missing_heights, _, _, _ = get_sents_height_on_tree(missing, datapoint)
        tot_missing_sent_height += sum([mh['height'] for mh in missing_heights])
        correct_heights, _, _, _ = get_sents_height_on_tree(correct_retrieved, datapoint)
        tot_correct_sent_height += sum([ch['height'] for ch in correct_heights])

        # if verbose and len(missing) > 0:
        if verbose and len(missing) > 0:
            hypothesis = datapoint['hypothesis']
            question = datapoint['question']
            answer = datapoint['answer']

            print('hypothesis', hypothesis)
            print('Q + A', question + ' -> ' + answer)
            print('=====')
            print('retrieved:', correct_retrieved)
            print('missing:', missing)
            print()

    recall = tot_sent_correct / float(tot_sent)  # <===== HERE!
    all_correct = tot_no_missing / float(len(dataset))
    avg_correct_sent_height = tot_correct_sent_height / (float(tot_sent_correct) + 1e-9)
    avg_missing_sent_height = tot_missing_sent_height / (float(tot_sent_missing) + 1e-9)
    print('recall:', recall)
    print('all correct:', all_correct)
    print('number of retrieved not in corpus:', tot_sent_not_in_wt)
    print('avg height of correct sentences:', avg_correct_sent_height)
    print('avg height of missing sentences:', avg_missing_sent_height)

    return recall, correct_retrieved_lst, errors_lst, missing_lst
Xinrui-Wang commented 1 year ago

Did you get the same results as the paper when you ran the entailment_retrieval.ipynb ?

scut-xz commented 1 year ago

Did you get the same results as the paper when you ran the entailment_retrieval.ipynb ?

I can't get the same results, it seems that the results worse that the baseline of the EntailmentWriter

Xinrui-Wang commented 1 year ago

No,I got a much worse result than baseline  of the EntailmentWriter, I think his paper data is falsified.

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年4月25日(星期二) 下午2:04 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [amazon-science/irgr] Asking for the retrieval results & Question about the @.*** metric (Issue #2)

Did you get the same results as the paper when you ran the entailment_retrieval.ipynb ?

I can't get the same results, it seems that the results worse that the baseline of the EntailmentWriter

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

scut-xz commented 1 year ago

No,I got a much worse result than baseline  of the EntailmentWriter, I think his paper data is falsified. ------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年4月25日(星期二) 下午2:04 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [amazon-science/irgr] Asking for the retrieval results & Question about the @. metric (Issue #2) Did you get the same results as the paper when you ran the entailment_retrieval.ipynb ? I can't get the same results, it seems that the results worse that the baseline of the EntailmentWriter — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

Hi! Could you find other method that could get a better result that the baseline?

Xinrui-Wang commented 1 year ago

我还以为你老外呢,比baseline更好的检索方法还没找到,咱俩加个微信聊吧,wxr199002