HKUDS / LightRAG

"LightRAG: Simple and Fast Retrieval-Augmented Generation"
https://arxiv.org/abs/2410.05779
MIT License
9.63k stars 1.19k forks source link

How to better control entity extraction and prevent hallucinating generic entities when attempting to categorize extracted information? #191

Closed Feed-dev closed 2 weeks ago

Feed-dev commented 3 weeks ago

What is going on when the indexing process triggers a summary of a non related topic? The summaries are always the same key words: "JOHN DOE", "ELON MUSK", "NEW YORK", "JOHN SMITH", "NASA", ... and more These summary key words are totally unrelated to the books I am ingesting in to the index. It happens with many different pdf books with similar niche topics. The pdf book processed here is about out of body experiences. Could this be gpt-4o-mini hallucinating or is it lightrag related?

Processing Journeys_Out_of_the_Body.pdf: 100%|██████████| 150/150 [00:24<00:00, 6.06it/s] INFO:lightrag:Creating a new event loop in a sub-thread. INFO:lightrag:[New Docs] inserting 32 docs INFO:lightrag:[New Chunks] inserting 32 chunks INFO:lightrag:Inserting 32 vectors to chunks INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:[Entity Extraction]… INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

⠹ Processed 32 chunks, 56 entities(duplicated), 19 relations(duplicated) INFO:lightrag:Inserting 55 vectors to entities INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Inserting 19 vectors to relationships INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Writing graph with 2268 nodes, 1122 edges INFO:lightrag:Creating a new event loop in a sub-thread. INFO:lightrag:[New Docs] inserting 32 docs INFO:lightrag:[New Chunks] inserting 32 chunks INFO:lightrag:Inserting 32 vectors to chunks INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:[Entity Extraction]... INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

DEBUG:lightrag:Trigger summary: "JOHN DOE" <---- ⠹ Processed 32 chunks, 54 entities(duplicated), 27 relations(duplicated) DEBUG:lightrag:Trigger summary: "ELON MUSK" <---- INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" INFO:lightrag:Inserting 53 vectors to entities INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Inserting 27 vectors to relationships INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Writing graph with 2295 nodes, 1141 edges INFO:lightrag:Creating a new event loop in a sub-thread. INFO:lightrag:[New Docs] inserting 32 docs INFO:lightrag:[New Chunks] inserting 32 chunks INFO:lightrag:Inserting 32 vectors to chunks INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:[Entity Extraction]... INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" ⠹ Processed 32 chunks, 47 entities(duplicated), 6 relations(duplicated) INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" DEBUG:lightrag:Trigger summary: "NEW YORK" <---- INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" INFO:lightrag:Inserting 47 vectors to entities INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Inserting 6 vectors to relationships INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Writing graph with 2321 nodes, 1146 edges INFO:lightrag:Creating a new event loop in a sub-thread. INFO:lightrag:[New Docs] inserting 32 docs INFO:lightrag:[New Chunks] inserting 32 chunks INFO:lightrag:Inserting 32 vectors to chunks INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:[Entity Extraction]... INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" DEBUG:lightrag:Trigger summary: "JOHN SMITH" <---- ⠹ Processed 32 chunks, 59 entities(duplicated), 33 relations(duplicated) INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" INFO:lightrag:Inserting 58 vectors to entities INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Inserting 33 vectors to relationships INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Writing graph with 2358 nodes, 1172 edges INFO:lightrag:Creating a new event loop in a sub-thread. INFO:lightrag:[New Docs] inserting 32 docs INFO:lightrag:[New Chunks] inserting 32 chunks INFO:lightrag:Inserting 32 vectors to chunks INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:[Entity Extraction]... INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" ⠹ Processed 32 chunks, 46 entities(duplicated), 20 relations(duplicated) INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" DEBUG:lightrag:Trigger summary: "NASA" <---- INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" INFO:lightrag:Inserting 44 vectors to entities INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Inserting 20 vectors to relationships INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Writing graph with 2384 nodes, 1184 edges INFO:lightrag:Creating a new event loop in a sub-thread. INFO:lightrag:[New Docs] inserting 32 docs INFO:lightrag:[New Chunks] inserting 32 chunks INFO:lightrag:Inserting 32 vectors to chunks INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:[Entity Extraction]... INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" ⠹ Processed 32 chunks, 81 entities(duplicated), 44 relations(duplicated) INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" DEBUG:lightrag:Trigger summary: "JOHN DOE" <---- DEBUG:lightrag:Trigger summary: "NEW YORK CITY" <---- INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" INFO:lightrag:Inserting 72 vectors to entities INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Inserting 44 vectors to relationships INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Writing graph with 2488 nodes, 1226 edges INFO:lightrag:Creating a new event loop in a sub-thread. INFO:lightrag:[New Docs] inserting 25 docs INFO:lightrag:[New Chunks] inserting 25 chunks INFO:lightrag:Inserting 25 vectors to chunks INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:[Entity Extraction]... INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" INFO:lightrag:Inserting 41 vectors to entities ⠴ Processed 25 chunks, 41 entities(duplicated), 13 relations(duplicated) INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Inserting 13 vectors to relationships INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:lightrag:Writing graph with 2512 nodes, 1238 edges Processing PDFs: 50%|█████ | 8/16 [21:12<24:04, 180.62s/it]

Feed-dev commented 3 weeks ago

Most likely gpt-4o-mini is hallucinating generic entities when attempting to categorize extracted information. How to better control entity extraction and prevent hallucination?

LarFii commented 2 weeks ago

Entity extraction is indeed a challenge, closely tied to the capabilities of LLMs. We are currently exploring improved extraction methods as well.