Corpus Setup for Lengthy Texts

daniyal214 commented 1 month ago

Hi, I've been exploring your remarkable framework and indeed has yielded great results. I'm particularly interested in applying it to lengthy documents, such as books etc.

I'd like to seek your guidance on the recommended approach for implementing it in such cases. Because it looks quite challenging to setup the corpus following the same convention (title, text, idx), as mentioned in repo:

[
  {
    "title": "FIRST PASSAGE TITLE",
    "text": "FIRST PASSAGE TEXT",
    "idx": 0
  },
  {
    "title": "SECOND PASSAGE TITLE",
    "text": "SECOND PASSAGE TEXT",
    "idx": 1
  }
]

I was thinking about creating chunks from the text book, but not sure about generating the titles for each, if it's mandatory.

So, I'm curious whether it's advisable to convert the text file of the medical book into the same JSON format, strictly following the convention with multiple titles, text, and idx. If it is, then what should be the recommended approach for such long documents? Or is it feasible to utilize the entire text as a single passage with a single title and idx?

Thanks!

yhshu commented 1 month ago

Hello,

Thanks for your interest! The title is treated as part of the text. If you chunk by section, then the section title corresponding to that chunk might be an appropriate choice for the title. If no such title exists, I suggest that you try assigning your logically acceptable ID in your case (e.g., Section 1.1 Section 1.2) or just leave it blank.

bernaljg commented 1 month ago

Just to expand on this answer for the case of long documents. We definitely noticed a drop in OpenIE quality as the length of documents increased. I would strongly suggest splitting your text into chunks with 2-3 sentences each or at most a paragraph. Also adding the title to each chunk will likely give the LLM more context when extracting triples.

daniyal214 commented 1 month ago

@yhshu and @bernaljg

Thank you both for your prompt response and valuable advice. Your assistance has been greatly appreciated.

I am seeking further guidance regarding the use of node parameters when employing the hipporag.rank_docs(query, top_k=10) method, which returns ranks, scores, and logs. Specifically, I am interested in the best practices for utilizing the node parameters to achieve an optimal RAG output.

My query is whether the retrieved documents, as per their scores, should simply be presented to the LLM as a context with the accompanying prompt to generate the desired answer from that context according to the question (as we do in traditional RAG for final response generation). Alternatively, is there a recommended approach to effectively generate the final response maybe by leveraging the top-ranked nodes? Like in one of my case I received these results:

# Ranks
[1, 0, 2]
# Scores
[1.0, 0.27169811895217233, 0.0]
# Logs
{'named_entities': ['white pigeons'], 'linked_node_scores': [['white pigeons', 'white bird', 0.7852239608764648]], '1-hop_graph_for_linked_nodes': [['tree', 1.0], ['cinderella', 1.0], ['tree', 1.0, 'inv'], ['cinderella', 1.0, 'inv']], 'top_ranked_nodes': ['white bird', 'cinderella', 'tree', 'step mother', 'step sisters', 'mother s grave', 'king s son', 'birds', 'festival', 'step daughters', 'branch on mother s grave', 'wedding', 'lentils', 'hazel twig on mother s grave', 'staircase', 'golden slippers from bird', 'church', 'beautiful dress from bird', 'back door', 'king'], 'nodes_in_retrieved_doc': [['cinderella', 'dresses', 'father', 'handsome tree', 'hazel twig', 'hazel twig on mother s grave', 'jewels', 'mother s grave', 'pearls', 'step daughters', 'tree', 'white bird'], ['beautiful dress from bird', 'birds', 'branch on mother s grave', 'church', 'cinderella', 'festival', 'golden slippers from bird', 'king s son', 'mother s grave', 'staircase', 'step mother', 'step sisters', 'tree', 'wedding'], ['back door', 'birds', 'bride', 'cinderella', 'dish of lentils', 'festival', 'king', 'king s son', 'lentils', 'step mother', 'step sisters']]}

Do we already have the implementation of final response generation through LLM in the codebase. I couldn't find such method implemented already in the HippoRAG API.

I would greatly appreciate your expert advice on this matter.

yhshu commented 1 month ago

For generation after retrieval, you should use these retrieved nodes to obtain your original documents, and use these documents for the final generation. For the generation part, you could check QA reader and implement it for your own purposes. This reader will read top-ranked documents (not nodes) from an input file to do the generation.

daniyal214 commented 1 month ago

That's great, thanks!! I will check that out and get back to you with the results. Thanks

OSU-NLP-Group / HippoRAG

Corpus Setup for Lengthy Texts #5