BaranziniLab / KG_RAG

Empower Large Language Models (LLM) using Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) for knowledge intensive tasks
Apache License 2.0
624 stars 84 forks source link

Error in retrieving context for some diseases #28

Closed janjoy closed 5 months ago

janjoy commented 5 months ago

Hi @karthiksoman,

I am trying to run the true_false_generation notebook and came across this error where it's not able to retrieve context from SPOKE for some diseases.

for index, row in question_df.iterrows():
    question = row["text"]
    context =  retrieve_context(row["text"], vectorstore, embedding_function_for_context_retrieval, node_context_df, CONTEXT_VOLUME, QUESTION_VS_CONTEXT_SIMILARITY_PERCENTILE_THRESHOLD, QUESTION_VS_CONTEXT_MINIMUM_SIMILARITY)
    # print few context lines
    context_lines = context.split("\n")[:3]
    print(context_lines)

Eg: for question and disease : Neurofibromatosis 2 is not associated with Gene NF2 it is failing and showing the error:


IndexError Traceback (most recent call last) File ~/miniconda3/envs/kg_rag/lib/python3.10/site-packages/tenacity/init.py:382, in Retrying.call(self, fn, *args, *kwargs) 381 try: --> 382 result = fn(args, **kwargs) 383 except BaseException: # noqa: B902

File ~/sulab_projects/KG_RAG/kg_rag/utility.py:125, in get_context_using_spoke_api(node_value) 124 context = merge_2['context'].str.cat(sep=' ') --> 125 context += node_value + " has a " + node_context[0]["data"]["properties"]["source"] + " identifier of " + node_context[0]["data"]["properties"]["identifier"] + " and Provenance of this association is " + node_context[0]["data"]["properties"]["source"] + "." 126 return context

IndexError: list index out of range

The above exception was the direct cause of the following exception:

RetryError Traceback (most recent call last) Cell In[132], line 3 1 for index, row in question_df.iterrows(): 2 question = row["text"] ----> 3 context = retrieve_context(row["text"], vectorstore, embedding_function_for_context_retrieval, node_context_df, CONTEXT_VOLUME, QUESTION_VS_CONTEXT_SIMILARITY_PERCENTILE_THRESHOLD, QUESTION_VS_CONTEXT_MINIMUM_SIMILARITY) 4 # find context first few lines and last few lines 5 context_lines = context.split("\n")[:3]

Cell In[79], line 15 ... --> 326 raise retry_exc from fut.exception() 328 if self.wait: 329 sleep = self.wait(retry_state)

RetryError: RetryError[<Future at 0x7fa2361b66e0 state=finished raised IndexError>]

karthiksoman commented 5 months ago

@janjoy can you post the link to the notebook that you mentioned? I couldn't locate the notebook named 'true_false_generation' in the notebooks directory of KG-RAG.

janjoy commented 5 months ago

@karthiksoman https://github.com/BaranziniLab/KG_RAG/blob/main/kg_rag/rag_based_generation/GPT/run_true_false_generation.py trying to run this file and it was giving some errors. So I tried to see for which questions (https://github.com/BaranziniLab/KG_RAG/blob/main/data/benchmark_data/true_false_questions.csv) it is not retrieving context. One example where it was failing was "Neurofibromatosis 2 is not associated with Gene NF2" statement in the csv file. It was giving error as it was not retrieving any context from SPOKE. I hope this is clear. Please let me know if you have more questions. Thank you =)

janjoy commented 5 months ago

@karthiksoman I checked again and found that SPOKE is not able to retrieve context for these two diseases from the list: Neurofibromatosis 2 and Familial Mediterranean Fever

karthiksoman commented 5 months ago

@janjoy Apologies for the delay! I was on vacation :)

Reason why KG-RAG is not able to fetch the context for these two diseases from SPOKE is because SPOKE got updated and the names of these two diseases also got updated and is currently not in accordance with the names stored in the vector database. That is the reason it is not returning any context for these two diseases.

For example: When you ask 'Neurofibromatosis 2 is not associated with Gene NF2', KG-RAG extracts 'Neurofibromatosis 2' from the query. But currently, 'Neurofibromatosis 2' is not part of SPOKE graph (after the update, but previously it was). Hence, it is not returning the context, because it does not have that node in the graph. This happened because the underlying Disease Ontology database (https://disease-ontology.org/) updated their data which got reflected in SPOKE (because SPOKE always synchronize its data with the underlying parent database, in this case Disease Ontology database) I presume this update should have affected only a handful of disease nodes. If you happen to encounter more such cases, please let me know, so that I can give the file that contains the disease names based on the current version of SPOKE and you may need to re-create the vectorDB so that it will be in-sync with the current version of SPOKE.

karthiksoman commented 5 months ago

@janjoy I am closing this issue since it addressed the reason for your question. Feel free to re-open it if you have more follow-up questions on this.

janjoy commented 5 months ago

Thanks Karthik!

On Fri, Apr 26, 2024 at 8:54 AM karthik-soman @.***> wrote:

@janjoy https://github.com/janjoy I am closing this issue since it addressed the reason for your question. Feel free to re-open it if you have more follow-up questions on this.

— Reply to this email directly, view it on GitHub https://github.com/BaranziniLab/KG_RAG/issues/28#issuecomment-2079661043, or unsubscribe https://github.com/notifications/unsubscribe-auth/APRASEL5J53QSDHXNXMVV2TY7J2CPAVCNFSM6AAAAABGKEWGNKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGY3DCMBUGM . You are receiving this because you were mentioned.Message ID: @.***>

janjoy commented 3 months ago

Hi @karthiksoman , I would like to request the file that contains the disease names based on the current version of SPOKE. Having trouble retrieving content for many such diseases especially while executing the MCQ test questions. Thank you!