Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
17.49k stars 1.35k forks source link

[BUG] NanoGraphRag / KeyError: '7' #451

Open vipervs opened 3 weeks ago

vipervs commented 3 weeks ago

Description

I got the following error when doing a simple QA with nano graph: Model: GPT4-o-mini

User-id: 1, can see public conversations: True Session reasoning type None Session LLM openai Reasoning class <class 'ktem.reasoning.simple.FullQAPipeline'> Reasoning state {'app': {'regen': False}, 'pipeline': {}} Thinking ... Retrievers [DocumentRetrievalPipeline(DS=<kotaemon.storages.docstores.lancedb.LanceDBDocumentStore object at 0x306b46c50>, FSPath=PosixPath('/Users/andi/kotaemon/ktem_app_data/user_data/files/index_1'), Index=<class 'ktem.index.file.index.IndexTable'>, Source=<class 'ktem.index.file.index.Source'>, VS=<kotaemon.storages.vectorstores.chroma.ChromaVectorStore object at 0x306b47a30>, get_extra_table=False, llm_scorer=LLMTrulensScoring(concurrent=True, normalize=10, prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x331cd0520>, system_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x331cd1300>, top_k=3, user_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x331f94dc0>), mmr=False, rerankers=[CohereReranking(cohere_api_key='WDIdNCKpcA7TlUc4y0IpjisPdNSPdZV8p7kXOrxI', model_name='rerank-multilingual-v2.0')], retrieval_mode='hybrid', top_k=10, userid=1), NanoGraphRAGRetrieverPipeline(DS=<theflow.base.unset object at 0x102dce230>, FSPath=<theflow.base.unset object at 0x102dce230>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset object at 0x102dce230>, VS=<theflow.base.unset_ object at 0x102dce230>, file_ids=['bac8649f-72af-44e6-b4c6-91f218d6d6a9'], userid=<theflow.base.unset object at 0x102dce230>)] searching in doc_ids [] INFO:ktem.index.file.pipelines:Skip retrieval because of no selected files: DocumentRetrievalPipeline( (vector_retrieval): <function Function._prepare_child..exec at 0x331c1dfc0> (embedding): <function Function._prepare_child..exec at 0x331c1df30> ) INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" GraphRAG embedding dim 3072 INFO:nano-graphrag:Load KV full_docs with 0 data INFO:nano-graphrag:Load KV text_chunks with 0 data INFO:nano-graphrag:Load KV llm_response_cache with 0 data INFO:nano-graphrag:Load KV community_reports with 0 data INFO:nano-graphrag:Loaded graph from /Users/andi/kotaemon/ktem_app_data/user_data/files/nano_graphrag/d897887f-bb79-42f5-aabd-d398b9a7f669/input/graph_chunk_entity_relation.graphml with 290 nodes, 188 edges INFO:nano-vectordb:Load (276, 3072) data INFO:nano-vectordb:Init {'embedding_dim': 3072, 'metric': 'cosine', 'storage_file': '/Users/andi/kotaemon/ktem_app_data/user_data/files/nano_graphrag/d897887f-bb79-42f5-aabd-d398b9a7f669/input/vdb_entities.json'} 276 data INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" Traceback (most recent call last): File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/gradio/queueing.py", line 575, in process_events response = await route_utils.call_process_api( File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/gradio/route_utils.py", line 276, in call_process_api output = await app.get_blocks().process_api( File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1923, in process_api result = await self.call_function( File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1520, in call_function prediction = await utils.async_iteration(iterator) File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/gradio/utils.py", line 663, in async_iteration return await iterator.anext() File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/gradio/utils.py", line 656, in anext return await anyio.to_thread.run_sync( File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2441, in run_sync_in_worker_thread return await future File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 943, in run result = context.run(func, args) File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/gradio/utils.py", line 639, in run_sync_iterator_async return next(iterator) File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/gradio/utils.py", line 801, in gen_wrapper response = next(iterator) File "/Users/andi/kotaemon/libs/ktem/ktem/pages/chat/init.py", line 899, in chat_fn for response in pipeline.stream(chat_input, conversation_id, chat_history): File "/Users/andi/kotaemon/libs/ktem/ktem/reasoning/simple.py", line 705, in stream docs, infos = self.retrieve(message, history) File "/Users/andi/kotaemon/libs/ktem/ktem/reasoning/simple.py", line 503, in retrieve retriever_docs = retriever_node(text=query) File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/theflow/base.py", line 1097, in call raise e from None File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/theflow/base.py", line 1088, in call output = self.fl.exec(func, args, kwargs) File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/theflow/backends/base.py", line 151, in exec return run(args, kwargs) File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/theflow/middleware.py", line 144, in call raise e from None File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/theflow/middleware.py", line 141, in call _output = self.next_call(*args, *kwargs) File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/theflow/middleware.py", line 117, in call return self.next_call(args, kwargs) File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/theflow/base.py", line 1017, in _runx return self.run(*args, **kwargs) File "/Users/andi/kotaemon/libs/ktem/ktem/index/file/graph/nano_pipelines.py", line 355, in run entities, relationships, reports, sources = asyncio.run( File "/opt/homebrew/Cellar/python@3.10/3.10.14_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/Users/andi/kotaemon/libs/ktem/ktem/index/file/graph/nano_pipelines.py", line 142, in nano_graph_rag_build_local_query_context use_communities = await _find_most_related_community_from_entities( File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/nano_graphrag/_op.py", line 698, in _find_most_related_community_from_entities related_community_keys = sorted( File "/Users/andi/kotaemon/venv/lib/python3.10/site-packages/nano_graphrag/_op.py", line 702, in related_community_datas[k]["report_json"].get("rating", -1), KeyError: '7' INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" User-id: 1, can see public conversations: True


the main issue here is the KeyError: '7', which is being raised during the execution of the _find_most_related_community_from_entities function in the nano_graphrag module. This suggests that the code is trying to access a key (‘7’) in the related_community_datas dictionary that does not exist.

Here’s what could be contributing to this problem:

1.  Missing Data in Dictionary: The related_community_datas dictionary might not have an entry for the key ‘7’, resulting in a KeyError when the code tries to access related_community_datas[k]["report_json"].
2.  Incomplete or Incorrect Data Structure: If the data in the related_community_datas dictionary is incomplete or not structured as expected, this can lead to issues when the code attempts to retrieve specific fields like "report_json" and "rating".
3.  Data Retrieval Logic: The logic in the lambda function might be assuming that all keys have a "report_json" entry with a "rating" field, but that assumption does not hold true for all entries in the data structure.

How to Address This Issue:

•   Check Data Integrity: Verify the contents of the related_community_datas dictionary to ensure that all expected keys and fields exist. This might involve adding some debugging or logging to check which keys are present and how the data is structured.
•   Handle Missing Keys Gracefully: Modify the code to handle cases where a key or nested field is missing. For example, you could add a check before attempting to access related_community_datas[k]["report_json"] or use .get() methods with default values to avoid KeyError.
•   Review Data Loading: Ensure that the data being loaded into related_community_datas is complete and consistent with the requirements of the program. This might involve reviewing how data is generated or retrieved before it’s processed.

Reproduction steps

1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

Screenshots

![DESCRIPTION](LINK.png)

Logs

No response

Browsers

No response

OS

No response

Additional information

No response

taprosoft commented 3 weeks ago

@vipervs this seems to be nano-graphrag specific issue. Sometime I observe that JSON community generation can be funky if not using larger LLMs (GPT4o). Please also raise your request and model configuration to https://github.com/gusye1234/nano-graphrag/.