RAG0016: Populate Knowledge Graph on Graph Database

tenzin3 commented 2 months ago

Description

This project involves the population of a knowledge graph within a graph database. The aim is to store triples and structured data, which represent entities and their relationships, into the graph database.

Expected Output

The ability to store a knowledge graph in a JSON file.
The creation of a schema within the graph database that accurately represents the knowledge graph structure.
The capability to efficiently retrieve information from the knowledge graph based on user queries.

Implementation Plan

[x] Choose a graph database.
[ ] Create a schema within the graph database.
[x] Insert data into the graph database.
[ ] Retrieve data based on user requests.
[ ] Update the schema as needed when new data is added.

tenzin3 commented 2 months ago

TerminusDB was our first choice as it is a graph database that supports versioning with its data. However, their team has shifted focus to other projects, and due to the small community, we decided to move away from TerminusDB.

tenzin3 commented 2 months ago

There are many graph database options available, but very few offer a free community edition. One that does, and has the largest community in the world, is Neo4j.

Another interesting option is Memgraph, which has following features

compatible with Neo4j
uses less memory compared to Neo4j
can deliver speeds up to 120 times faster than Neo4j

tenzin3 commented 2 months ago

Cypher Languages Necessary codes for memgraph Lab

Show all entities: MATCH (n) RETURN n; Show all entities with relation: MATCH (n)-[r]->(m) RETURN n, r, m; Delete all data: MATCH (n) DETACH DELETE n;

tenzin3 commented 2 months ago

Graph Visualization from the Memgraph Lab

Insert knowledge graph triplets

from neo4j import GraphDatabase

URI = "bolt://localhost:7687"
AUTH = ("", "")

def insert_triplets(triplets):
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        with driver.session() as session:
            for head, relation, tail in triplets:
                session.run(
                    f"MERGE (h:Entity {{name: $head}}) "
                    f"MERGE (t:Entity {{name: $tail}}) "
                    f"MERGE (h)-[:{relation}]->(t)",
                    head=head, tail=tail
                )

triplets = [
    ("DalaiLama", "WasBornIn", "Taktser"),
    ("Taktser", "isLocatedIn", "Dokham"),
    ("Dokham", "isPartOf", "Tibet"),
    ("Khampa", "LivesIn", "Dokham"),
    ("Dokham","DescendsTo","China"),
    ("DalaiLama","WasBornIn","WoodHogYear"),
    ("AmiChiri","IsSouthOf","Taktser"),
]

insert_triplets(triplets)

fetch knowledge graph triplets

from neo4j import GraphDatabase

URI = "bolt://localhost:7687"
AUTH = ("", "")

def fetch_data():
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        with driver.session() as session:
            result = session.run("MATCH (h)-[r]->(t) RETURN h.name, type(r), t.name")
            for record in result:
                print(record["h.name"], record["type(r)"], record["t.name"])

fetch_data()

tenzin3 commented 2 months ago

Graph Visualization from the Memgraph Lab

Graph Data schema

Insert Knowledge Graph triplets with Properties

Data is from here

from neo4j import GraphDatabase

URI = "bolt://localhost:7687"
AUTH = ("", "")

def insert_triplets(data):
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        with driver.session() as session:
            # Insert nodes
            for node in data['nodes']:
                entity_type = node["type"]  
                properties = node.get('attributes', {})
                properties['name'] = node['label']
                session.run(f"CREATE (n:{entity_type} $props)", {'props': properties})  

            # Insert edges
            for edge in data['edges']:
                source = edge['source']
                target = edge['target']
                relation = edge['relation']
                session.run(
                    f"MATCH (a {{name: $source}}), (b {{name: $target}}) "
                    f"CREATE (a)-[:{relation}]->(b)",
                    {'source': source, 'target': target}
                )

import json 

with open('kg_data.json', 'r') as file:
    data = json.load(file)

insert_triplets(data)

tenzin3 commented 2 months ago

@teny19 suggestions:> Methods to clean the knowledge graph

Perform string similarity when collating nodes and relations into one giant knowledge graph.
convert nodes (name: string) into embedding using the fintuned embedding model and then perform cosine similarity check to get similar nodes.
check overlapping relations and properties.
human in the loop for final quality review

Test for 3-5 pages initially to test the methods and then if satisfactory then going ahead for the 1 chapter and then for whole book.

OpenPecha / rag_prep_tool