OSU-NLP-Group / HippoRAG

[NeurIPS'24] HippoRAG is a novel RAG framework inspired by human long-term memory that enables LLMs to continuously integrate knowledge across external documents. RAG + Knowledge Graphs + Personalized PageRank.
https://arxiv.org/abs/2405.14831
MIT License
1.42k stars 117 forks source link

retrieval evaluation #57

Open sjk0825 opened 1 month ago

sjk0825 commented 1 month ago

icrot_hipporag.py include a recall program.

I have question in evaluation process about below source code. below code shows a title-level recall evaluation. (means if sp is in some title == answer title) than recall score raise.

Retrieval evaluation score is title-level in your Project?

    # calculate recall
    if args.dataset in ['hotpotqa', 'hotpotqa_train']:
        gold_passages = [item for item in sample['supporting_facts']]
        gold_items = set([item[0] for item in gold_passages])
        retrieved_items = [passage.split('\n')[0].strip() for passage in retrieved_passages]
    elif args.dataset in ['2wikimultihopqa']:
        gold_passages = [item for item in sample['supporting_facts']]
        gold_items = set([item[0] for item in gold_passages])
        retrieved_items = [passage.split('\n')[0].strip() for passage in retrieved_passages]
    else:
        gold_passages = [item for item in sample['paragraphs'] if item['is_supporting']]
        gold_items = set([item['title'] + '\n' + item['text'] for item in gold_passages])
        retrieved_items = retrieved_passages

    # calculate metrics
    recall = dict()
    print(f'idx: {sample_idx + 1} ', end='')
    for k in k_list:
        recall[k] = round(sum(1 for t in gold_items if t in retrieved_items[:k]) / len(gold_items), 4)
        total_recall[k] += recall[k]
        print(f'R@{k}: {total_recall[k] / (sample_idx + 1):.4f} ', end='')
    print()
    print('[ITERATION]', it, '[PASSAGE]', len(retrieved_passages), '[THOUGHT]', thoughts)

    # record results
bernaljg commented 1 month ago

Hi, thanks for the question. We used the appropriate evaluation framework for each dataset. We used passage titles for evaluation in both HotpotQA and 2WikiMultiHop since they are unique but for MuSiQue we used the entire passage since many of them share a title.

sjk0825 commented 1 month ago

thank you for your kind answer and I have another question.

what is the passage in your paper? for example, in 2wikimultihop, Teutberga title has 2 passage in one title. than two passage share same title. than title is not unique for passage.

['Teutberga', ['Teutberga( died 11 November 875) was a queen of Lotharingia by marriage to Lothair II.', "She was a daughter of Bosonid Boso the Elder and sister of Hucbert, the lay- abbot of St. Maurice's Abbey."]]

So, passage means a concatenated passage ? or each sentence in same title?

bernaljg commented 1 month ago

right, for 2WikiMultiHop, we concatenate the sentences to make a passage and determine passage relevance by whether it has a supporting sentence within it.