OSU-NLP-Group / HippoRAG

HippoRAG is a novel RAG framework inspired by human long-term memory that enables LLMs to continuously integrate knowledge across external documents.
https://arxiv.org/abs/2405.14831
MIT License
900 stars 73 forks source link

Always getting the same documents list as the top ranked documents. #7

Closed daniyal214 closed 1 month ago

daniyal214 commented 1 month ago

Hi, After implementing it with my custom dataset, the issue I am facing is that it always the returns first three corpus document as the ranked documents, when I do: ranks, scores, logs = hipporag.rank_docs(query, top_k=10)

My data setup is data/cindrella_corpus.json like this:

[
    {
        "title": "cinderella story part 1",
        "text": "The wife of a rich man fell sick, and as she felt that her end was drawing near, she called her only daughter to her bedside and said, dear child, be good and pious, and then the good God will always protect you, and I will look down on you from heaven and be near you.",
        "idx": 0
    },
    {
        "title": "cinderella story part 2",
        "text": "Thereupon she closed her eyes and departed. Every day the maiden went out to her mother's grave, and wept, and she remained pious and good. When winter came the snow spread a white sheet over the grave, and by the time the spring sun had drawn it off again, the man had taken another wife.",
        "idx": 1
    },
    {
        "title": "cinderella story part 3",
        "text": "The woman had brought with her into the house two daughters, who were beautiful and fair of face, but vile and black of heart. Now began a bad time for the poor step-child. Is the stupid goose to sit in the parlor with us, they said. He who wants to eat bread must earn it.",
        "idx": 2
    },
    .....
    .....
]

And my indexing setup is:

%env DATA=cindrella
%env HF_RETRIEVER=facebook/contriever
%env LLM_MODEL=gpt-3.5-turbo-0125
%env SYNONYM_THRESH=0.8
%env GPUS=0
%env LLM_API=openai
%env extraction_type=ner
%env num_passages=all

# Running Open Information Extraction
!python3 src/openie_with_retrieval_option_parallel.py --dataset $DATA --llm $LLM_API --model_name $LLM_MODEL --run_ner --num_passages $num_passages # NER and OpenIE for passages
!python3 src/named_entity_extraction_parallel.py --dataset $DATA --llm $LLM_API --model_name $LLM_MODEL  # NER for queries

# Creating Contriever Graph
!python3 src/create_graph.py --dataset $DATA --model_name $HF_RETRIEVER --extraction_model $LLM_MODEL --threshold $SYNONYM_THRESH --extraction_type $extraction_type --cosine_sim_edges

# Getting Nearest Neighbor Files
%env CUDA_VISIBLE_DEVICES=0
!python3 src/RetrievalModule.py --retriever_name $HF_RETRIEVER --string_filename output/query_to_kb.tsv
!python3 src/RetrievalModule.py --retriever_name $HF_RETRIEVER --string_filename output/kb_to_kb.tsv
!python3 src/RetrievalModule.py --retriever_name $HF_RETRIEVER --string_filename output/rel_kb_to_kb.tsv

!python3 src/create_graph.py --dataset $DATA --model_name $HF_RETRIEVER --extraction_model $LLM_MODEL --threshold $SYNONYM_THRESH --create_graph --extraction_type $extraction_type --cosine_sim_edges

which returns the response as:

env: DATA=cindrella
env: HF_RETRIEVER=facebook/contriever
env: LLM_MODEL=gpt-3.5-turbo-0125
env: SYNONYM_THRESH=0.8
env: GPUS=0
env: LLM_API=openai
env: extraction_type=ner
env: num_passages=all
ner_gpt-3.5-turbo-0125_57
100%|█████████████████████████████████████████████| 5/5 [00:11<00:00,  2.26s/it]
100%|█████████████████████████████████████████████| 5/5 [00:13<00:00,  2.69s/it]
100%|█████████████████████████████████████████████| 6/6 [00:13<00:00,  2.29s/it]
100%|█████████████████████████████████████████████| 5/5 [00:13<00:00,  2.79s/it]
100%|█████████████████████████████████████████████| 6/6 [00:14<00:00,  2.36s/it]
100%|█████████████████████████████████████████████| 6/6 [00:14<00:00,  2.39s/it]
100%|█████████████████████████████████████████████| 6/6 [00:14<00:00,  2.41s/it]
100%|█████████████████████████████████████████████| 6/6 [00:15<00:00,  2.54s/it]
100%|█████████████████████████████████████████████| 6/6 [00:15<00:00,  2.56s/it]
100%|█████████████████████████████████████████████| 6/6 [00:16<00:00,  2.70s/it]
/usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
OpenIE saved to output/openie_cindrella_results_ner_gpt-3.5-turbo-0125_57.json
No queries will be processed for later retrieval. File data/cindrella.json does not exist
[nltk_data] Downloading package stopwords to /tmp...
[nltk_data]   Package stopwords is already up-to-date!
100%|████████████████████████████████████████| 57/57 [00:00<00:00, 21965.76it/s]
Correct Wiki Format: 0 out of 57
0it [00:00, ?it/s]
env: CUDA_VISIBLE_DEVICES=0
No Pre-Computed Vectors. Confirming PLM Model.
Loading PLM Vectors.
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 3008.83it/s]
100%|████████████████████████████████████| 208/208 [00:00<00:00, 2148805.99it/s]
Encoding 191 Missing Strings
tokenizer_config.json: 100%|███████████████████| 321/321 [00:00<00:00, 3.32MB/s]
vocab.txt: 100%|█████████████████████████████| 232k/232k [00:00<00:00, 22.2MB/s]
tokenizer.json: 100%|████████████████████████| 466k/466k [00:00<00:00, 46.6MB/s]
special_tokens_map.json: 100%|█████████████████| 112/112 [00:00<00:00, 1.38MB/s]
100%|████████████████████████████████████████| 191/191 [00:00<00:00, 381.71it/s]
1it [00:00, 266.22it/s]
Populating Vector Dict
100%|██████████████████████████████████████| 208/208 [00:00<00:00, 20873.67it/s]
Vectors Loaded.
No Pre-Computed Vectors. Confirming PLM Model.
Loading PLM Vectors.
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 2242.94it/s]
100%|████████████████████████████████████| 416/416 [00:00<00:00, 2420014.51it/s]
Populating Vector Dict
100%|██████████████████████████████████████| 416/416 [00:00<00:00, 23031.33it/s]
Vectors Loaded.
Chunking
Building Index
Running Index Part 0

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 9151.87it/s]
Running Index Part 1

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 9485.08it/s]
Running Index Part 2

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 9309.71it/s]
Running Index Part 3

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 9281.90it/s]
Combining Index Chunks
208it [00:00, 18482.98it/s]
208it [00:00, 3767.30it/s]
No Pre-Computed Vectors. Confirming PLM Model.
Loading PLM Vectors.
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 2021.35it/s]
100%|████████████████████████████████████| 256/256 [00:00<00:00, 2014525.00it/s]
Encoding 109 Missing Strings
100%|████████████████████████████████████████| 109/109 [00:00<00:00, 234.49it/s]
1it [00:00, 481.16it/s]
Populating Vector Dict
100%|██████████████████████████████████████| 256/256 [00:00<00:00, 22457.58it/s]
Vectors Loaded.
Chunking
Building Index
Running Index Part 0

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 8389.95it/s]
Running Index Part 1

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 8238.83it/s]
Running Index Part 2

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 8204.02it/s]
Running Index Part 3

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 7965.94it/s]
Combining Index Chunks
128it [00:00, 18736.99it/s]
128it [00:00, 3952.70it/s]
[nltk_data] Downloading package stopwords to /tmp...
[nltk_data]   Package stopwords is already up-to-date!
100%|████████████████████████████████████████| 57/57 [00:00<00:00, 21222.84it/s]
Correct Wiki Format: 0 out of 57
0it [00:00, ?it/s]
Creating Graph
100%|████████████████████████████████████████| 57/57 [00:00<00:00, 17914.97it/s]
Loading Vectors
Augmenting Graph from Similarity
100%|██████████████████████████████████████| 208/208 [00:00<00:00, 53085.99it/s]
Saving Graph
                                                      1
0                                                      
Total Phrases                                       864
Unique Phrases                                      208
Number of Individual Triples                        288
Number of Incorrectly Formatted Triples (ChatGP...    7
Number of Triples w/o NER Entities (ChatGPT Error)   86
Number of Unique Individual Triples                 273
Number of Entities                                  576
Number of Relations                                 377
Number of Unique Entities                           208
Number of Synonymy Edges                            140
Number of Unique Relations                          128

my HippoRAG test file is src/hippo_test.py:

import argparse
from hipporag import HippoRAG

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--dataset', type=str)
    parser.add_argument('--query', type=str)
    args = parser.parse_args()

    hipporag = HippoRAG(corpus_name=args.dataset)

    queries = [args.query]
    for query in queries:
        ranks, scores, logs = hipporag.rank_docs(query, top_k=10)

        print(ranks)
        print(scores)
        print(logs)

and run it: !python3 src/hippo_test.py --dataset $DATA --query "What the two white pigeons did?" which returned:

[1, 0, 2]
[1.0, 0.27169811895217233, 0.0]
{'named_entities': ['white pigeons'], 'linked_node_scores': [['white pigeons', 'white bird', 0.7852239608764648]], '1-hop_graph_for_linked_nodes': [['tree', 1.0], ['cinderella', 1.0], ['tree', 1.0, 'inv'], ['cinderella', 1.0, 'inv']], 'top_ranked_nodes': ['white bird', 'cinderella', 'tree', 'step mother', 'step sisters', 'mother s grave', 'king s son', 'birds', 'festival', 'step daughters', 'branch on mother s grave', 'wedding', 'lentils', 'hazel twig on mother s grave', 'staircase', 'golden slippers from bird', 'church', 'beautiful dress from bird', 'back door', 'king'], 'nodes_in_retrieved_doc': [['cinderella', 'dresses', 'father', 'handsome tree', 'hazel twig', 'hazel twig on mother s grave', 'jewels', 'mother s grave', 'pearls', 'step daughters', 'tree', 'white bird'], ['beautiful dress from bird', 'birds', 'branch on mother s grave', 'church', 'cinderella', 'festival', 'golden slippers from bird', 'king s son', 'mother s grave', 'staircase', 'step mother', 'step sisters', 'tree', 'wedding'], ['back door', 'birds', 'bride', 'cinderella', 'dish of lentils', 'festival', 'king', 'king s son', 'lentils', 'step mother', 'step sisters']]}

So no matter what the question is it always resulted between 0, 1, 2, and only returns three documents not 10. Whereas in the corpus list, the text with idx 0, 1 or 2 do not have any context regarding white pigeon. You can also notice that top_ranked_nodes also does not mention 'white_pigeon', whereas I have this discussed in the corpus with idx 17, 22 etc.

Could you please tell me what could be the reason of this not retrieving the correct documents, and how could this be modified to get the desired results.

Thanks!

yhshu commented 1 month ago

Is it possible that this is due to the process in which the 3 passages were used for extraction? The passage file was subsequently modified, but the cache was not cleared. So it still reads these 3 passages? @bernaljg

bernaljg commented 1 month ago

This behavior is quite strange, I tried to reproduce this in my own environment with a synthetic corpus and it works correctly. I'm willing to jump on a call and try to debug this if you'd like though!

daniyal214 commented 1 month ago

@bernaljg Sure, that would be great. Thanks for the response.

daniyal214 commented 1 month ago

@yhshu Thanks for the response. I'm not sure, maybe. I'll try once again with a fresh environment with cleared cache, then will see if it still persists.

bernaljg commented 1 month ago

@daniyal214 I suggest you also clear the output directory before re-running. If the bug is still happening shoot me an email and we can look at it together.

daniyal214 commented 1 month ago

@bernaljg @yhshu Thank you both for your quick and helpful responses!

It seems like the issue was indeed related to the process involving the passages and the cache. I set up a fresh environment, and re-indexed everything. This resolved the problem.

I really appreciate both of your assistance and willingness to help debug the issue. Thanks again!

I'll go ahead and close this issue now.