OSU-NLP-Group / HippoRAG

HippoRAG is a novel RAG framework inspired by human long-term memory that enables LLMs to continuously integrate knowledge across external documents. RAG + Knowledge Graphs + Personalized PageRank.
https://arxiv.org/abs/2405.14831
MIT License
1.24k stars 107 forks source link

A question regarding PPR on KB nodes #48

Closed weicheng113 closed 2 days ago

weicheng113 commented 1 month ago

Dear authors,

Thanks for sharing this great paper and its implementation details. I have a question regarding the following implementation of PPR on KB nodes.

https://github.com/OSU-NLP-Group/HippoRAG/blob/5ddedf0f516c7bbed777ba54da680c8bb8fb8f84/src/hipporag.py#L524-L534

From my understanding, self.kb_node_phrase_to_id is document entity phrases to id mapping, and reset_prob is linked query entity phrase weighting. My question is when doing personalized pagerank, why graph node connections(one entity phrase connects to another entity phrase. It seems we have this information from triples?) are not considered? How are these nodes are connected at the moment? Do we treat all nodes are connected to each other?

Thanks in advance for your time. Cheng

yhshu commented 1 month ago

Hello,

You're right about reset_prob and the phrase_to_id map. The personalized PageRank algorithm is a graph search algorithm, so node connections are considered. The graph self.g has stored nodes and edges. If you're asking why we don't assign edges with weights, it's because, so far, we've found that this doesn't help the graph search well.

Nodes are connected in two forms: 1. extraction of OpenIE and 2. synonymous edges. Please check our paper for more details. Thus, it's not that all nodes are connected to each other.

weicheng113 commented 1 month ago

@yhshu Thanks for your time and prompt reply. Sorry for my negligence. I did not realize that connections are initialized at constructor method, which makes sense. Appreciate your help. Thanks.

weicheng113 commented 1 month ago

@yhshu Hello, one more question regarding the following implementation.

https://github.com/OSU-NLP-Group/HippoRAG/blob/5ddedf0f516c7bbed777ba54da680c8bb8fb8f84/src/create_graph.py#L287-L295

Is there a need to consider if sim_edge has already existed from the original graph which was built from triples? Otherwise, the node connection weight will be overwritten. We do the following instead for example or skip if edge already existed. Just come across this. Not sure if there is such case and how often this will occur.

graph_plus[sim_edge] += similarity_max * score  # += instead of =

Reading a bit more on retrieval side. When doing retrieval, it looks the edge weight was considered in the following implementation.

https://github.com/OSU-NLP-Group/HippoRAG/blob/5ddedf0f516c7bbed777ba54da680c8bb8fb8f84/src/hipporag.py#L459-L467

When edge weights was assigned to graph self.g.es

self.g.es['weight'] = [self.graph_plus[(v1, v3)] for v1, v3 in edges]

Thanks for your time and help. Cheng Wei

yhshu commented 1 month ago

I think @bernaljg can answer this better.

bernaljg commented 1 month ago

hi @weicheng113,

Thanks for the question! Yeah you're right, similarity edge weights overwrite relation edge weights in our current implementation. This likely doesn't happen very often and when it does the difference between the existing edge weights and the score assigned via this similarity edge mechanism is small. I expect this change would have a negligible effect on the algorithm's output but this would have to be verified empirically.

weicheng113 commented 1 month ago

Thanks a lot @yhshu and @bernaljg, for your help and detailed explanation. It is clear now.

weicheng113 commented 1 month ago

Hello @yhshu , I got a question regarding to dense retrieval model used. Just wonder why not choosing more recent dense embedding model like BAAI/bge models instead of Contriever from facebook? Is there any reason? Thanks.

yhshu commented 1 month ago

When we conducted the research, models such as the BGE released earlier this year were contemporaneous work. Of course, users can choose their embedding models and we are also exploring this aspect.

weicheng113 commented 1 month ago

Thanks @yhshu for your prompt reply. HippoRag is very attractive for real world application for its speed and elegant idea. Currently, I am more interested in its combination with Contriever(or other alternative dense embedding models). For ColBERTv2, there are a couple of issues in my mind. It is tightly coupled with faiss, which looks not very user-friendly. Secondly, incrementally adding new document or updating existing document in a document set seems complicated for ColBERTv2, needing to recalculate centroids and so on.

weicheng113 commented 3 days ago

@yhshu Hello, just let you know I wrote an email to your paper contact about code improvement. Thanks.