Closed weicheng113 closed 2 days ago
Hello,
You're right about reset_prob and the phrase_to_id map. The personalized PageRank algorithm is a graph search algorithm, so node connections are considered. The graph self.g
has stored nodes and edges. If you're asking why we don't assign edges with weights, it's because, so far, we've found that this doesn't help the graph search well.
Nodes are connected in two forms: 1. extraction of OpenIE and 2. synonymous edges. Please check our paper for more details. Thus, it's not that all nodes are connected to each other.
@yhshu Thanks for your time and prompt reply. Sorry for my negligence. I did not realize that connections are initialized at constructor method, which makes sense. Appreciate your help. Thanks.
@yhshu Hello, one more question regarding the following implementation.
Is there a need to consider if sim_edge has already existed from the original graph which was built from triples? Otherwise, the node connection weight will be overwritten. We do the following instead for example or skip if edge already existed. Just come across this. Not sure if there is such case and how often this will occur.
graph_plus[sim_edge] += similarity_max * score # += instead of =
Reading a bit more on retrieval side. When doing retrieval, it looks the edge weight was considered in the following implementation.
When edge weights was assigned to graph self.g.es
self.g.es['weight'] = [self.graph_plus[(v1, v3)] for v1, v3 in edges]
Thanks for your time and help. Cheng Wei
I think @bernaljg can answer this better.
hi @weicheng113,
Thanks for the question! Yeah you're right, similarity edge weights overwrite relation edge weights in our current implementation. This likely doesn't happen very often and when it does the difference between the existing edge weights and the score assigned via this similarity edge mechanism is small. I expect this change would have a negligible effect on the algorithm's output but this would have to be verified empirically.
Thanks a lot @yhshu and @bernaljg, for your help and detailed explanation. It is clear now.
Hello @yhshu , I got a question regarding to dense retrieval model used. Just wonder why not choosing more recent dense embedding model like BAAI/bge models instead of Contriever from facebook? Is there any reason? Thanks.
When we conducted the research, models such as the BGE released earlier this year were contemporaneous work. Of course, users can choose their embedding models and we are also exploring this aspect.
Thanks @yhshu for your prompt reply. HippoRag is very attractive for real world application for its speed and elegant idea. Currently, I am more interested in its combination with Contriever(or other alternative dense embedding models). For ColBERTv2, there are a couple of issues in my mind. It is tightly coupled with faiss, which looks not very user-friendly. Secondly, incrementally adding new document or updating existing document in a document set seems complicated for ColBERTv2, needing to recalculate centroids and so on.
@yhshu Hello, just let you know I wrote an email to your paper contact about code improvement. Thanks.
Dear authors,
Thanks for sharing this great paper and its implementation details. I have a question regarding the following implementation of PPR on KB nodes.
https://github.com/OSU-NLP-Group/HippoRAG/blob/5ddedf0f516c7bbed777ba54da680c8bb8fb8f84/src/hipporag.py#L524-L534
From my understanding,
self.kb_node_phrase_to_id
is document entity phrases to id mapping, andreset_prob
is linked query entity phrase weighting. My question is when doingpersonalized pagerank
, why graph node connections(one entity phrase connects to another entity phrase. It seems we have this information from triples?) are not considered? How are these nodes are connected at the moment? Do we treat all nodes are connected to each other?Thanks in advance for your time. Cheng