RAG0017: Clean Knowledge Graph Data

tenzin3 commented 1 month ago

Description

Knowledge graph triples are generated by providing prompts to LLMs. Due to constraints like context length and the need for better output quality, the unstructured text is processed in smaller chunks rather than all at once. As a result, a large amount of fragmented graph data is produced. In this scenario, the processes of collating, deduplicating, and eliminating similar relations and entities become crucial to ensure accuracy and efficiency.

Expected Output

A deduplicated and consolidated knowledge graph with unique entities and relations, ensuring clarity and eliminating redundancy.

Implementation Plan

[ ] combine all graphs
[ ] filter with string similarity
[ ] filter with embedding similarity checking

tenzin3 commented 1 month ago

Methods to clean the knowledge graph

Perform string similarity when collating nodes and relations into one giant knowledge graph.
convert nodes (name: string) into embedding using the fintuned embedding model and then perform cosine similarity check to get similar nodes.
check overlapping relations and properties.
human in the loop for final quality review

tenzin3 commented 1 month ago

Graph Schema(uncleaned):

Observation:

there are few entities schema in total 15, which i believe is very simplified and good.
some entities(Structure and Deity) has few associated with it while some(Location) has huge number of nodes associated with it.
lot of filtering and cleaning needed for relation

OpenPecha / toolkit-v2