OpenPecha / rag_prep_tool

MIT License
0 stars 0 forks source link

RAG0015: Knowledge Graph creation #15

Open tenzin3 opened 1 month ago

tenzin3 commented 1 month ago

Description

Currently for our RAG Chatbot , we are using vector database to store the data(Dalai Lama books) in chunks. This task is implement a feature such that this can able to extract knowledge graph from a small piece of text.

Expected Output

A graph data(json file) on a small piece of text. (My land and my people first page.)

Implementation Plan

Image

Work Items

tenzin3 commented 1 month ago

Installation Link

https://snapcraft.io/terminusdb

For ubuntu just sudo snap install terminusdb

How to set up terminus db data base

For linux,

git clone https://github.com/terminusdb/terminusdb
cd terminusdb
make install-tus
make
make install-dashboard

terminusdb store init --key "my_password_here"
terminusdb serve

Server starts on http://127.0.0.1:6363, By default user name is admin

More detail are here

tenzin3 commented 1 month ago

Terminus db tutorial

youtube link Step by Step tutorial

tenzin3 commented 1 month ago

Prompt to generate triplets

prompt = f"""
Extract all possible RDF triples from the following text:

1. Capture all key entities, relationships, and properties mentioned.
2. Ensure no relevant information is missed, including indirect or less obvious relationships.
3. Extract at least 1 triple from each sentence.
4. If a relationship can be expressed in multiple ways, generate multiple triples to capture all variations.
5. Give me the output as RDF format.

Text: {text}

Return only the RDF triples in RDF format, without any additional explanations or text.
"""

My land and My people Chapter One Output

@prefix ex: <http://example.org/entities/> .
@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:DalaiLama14 a dbo:Person ;
    dbo:birthDate "1935-01-01"^^xsd:date ;
    dbo:birthPlace ex:Taktser ;
    dbo:title "Dalai Lama" .

ex:Taktser a dbo:Place ;
    dbo:location ex:Dokham ;
    dbo:altitude "9000"^^xsd:integer .

ex:Dokham a dbo:Region ;
    dbo:partOf dbr:Tibet ;
    dbo:description "Part of Tibet where mountains begin to descend to the plains of the east, towards China" .

ex:AmiChiri a dbo:Mountain ;
    dbo:location ex:Taktser ;
    rdfs:label "The Mountain which Pierces the Sky"@en ;
    dbo:higherThan ex:Taktser .

ex:KarmaSharTsongRidro a dbo:Monastery ;
    dbo:location ex:Taktser ;
    dbo:foundedBy ex:KarmaRolpaiDorje ;
    dbo:historicalSignificance "Place where Tsongkhapa was initiated as a monk" .

ex:KarmaRolpaiDorje a dbo:Person ;
    dbo:title "Fourth reincarnation of Karmapa" .

ex:AmdoIhakyung a dbo:Monastery ;
    dbo:location ex:Taktser .

ex:Kumbum a dbo:City ;
    dbo:near ex:Taktser .

ex:Sining a dbo:City ;
    dbo:near ex:Taktser .

ex:ThuptenGyatso a dbo:Person ;
    dbo:title "Thirteenth Dalai Lama" ;
    dbo:deathDate "1933-01-01"^^xsd:date .

ex:PotalaPalace a dbo:Building ;
    dbo:location dbr:Lhasa ;
    dbo:contains ex:GoldenMausoleum .

ex:GoldenMausoleum a dbo:Building ;
    dbo:description "Mausoleum built for Thupten Gyatso" .

ex:LhamoiLatso a dbo:Lake ;
    dbo:location dbr:Tibet ;
    dbo:significance "Lake where visions of future events appear" .

ex:Chokhorgyal a dbo:Place ;
    dbo:near ex:LhamoiLatso .

ex:KewtsangRinpoche a dbo:Person ;
    dbo:role "Leader in search for Dalai Lama's reincarnation" .

ex:LosangTsewang a dbo:Person ;
    dbo:role "Junior monastic official in search for Dalai Lama's reincarnation" .

ex:Norbulinka a dbo:Building ;
    dbo:location dbr:Lhasa ;
    dbo:description "Summer residence of the Dalai Lama" .

ex:RegentTibet a dbo:Position ;
    dbo:responsibility "Governing Tibet until new Dalai Lama matures" .

ex:NationalAssemblyTibet a dbo:Organization ;
    dbo:responsibility "Appointing the Regent of Tibet" .

ex:Thutopchu a dbo:River ;
    dbo:location dbr:Tibet .

ex:TraTsangLa a dbo:MountainPass ;
    dbo:location dbr:Tibet .

ex:Bumchen a dbo:Town ;
    dbo:location dbr:Tibet .

ex:PhuntsokDoeKhyel a dbo:Place ;
    dbo:location dbr:Lhasa .

ex:SiShiPhuntsok a dbo:Building ;
    dbo:location ex:PotalaPalace ;
    dbo:purpose "Enthronement ceremony of the Dalai Lama" .

ex:IronDragonYear a dbo:TimePeriod ;
    dbo:startDate "1940-01-01"^^xsd:date .

ex:PhuntsokDoeKhyel a dbo:Building ;
    dbo:location dbr:Lhasa ;
    dbo:purpose "First symbolic act of sovereignty by Fourteenth Dalai Lama"
tenzin3 commented 1 month ago

After meeting with @teny19

tenzin3 commented 1 month ago

@teny19

using prompt from here

Prompt

{"role": "system", "content": """# Knowledge Graph Instructions for GPT-4
                        ## 1. Overview
                        You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
                        - **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
                        - The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.
                        ## 2. Labeling Nodes
                        - **Consistency**: Ensure you use basic or elementary types for node labels.
                        - For example, when you identify an entity representing a person, always label it as **"person"**. Avoid using more specific terms like "mathematician" or "scientist".
                        - **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
                        ## 3. Handling Numerical Data and Dates
                        - Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.
                        - **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.
                        - **Property Format**: Properties must be in a key-value format.
                        - **Quotation Marks**: Never use escaped single or double quotes within property values.
                        - **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.
                        ## 4. Coreference Resolution
                        - **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
                        If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"), 
                        always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.  
                        Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. 
                        ## 5. Strict Compliance
                        Adhere to the rules strictly. Non-compliance will result in termination."""},
            {"role":"user", "content": f"Use the given format to extract information from the following input: [INPUT START] {input} [INPUT END]"},
            {"role":"user","content":"Tip: Make sure to answer in the correct format"}],

Output json from Chapter One My land and my people

{
  "nodes": [
    {
      "id": "Dalai Lama",
      "type": "person",
      "birthDate": "1935",
      "birthPlace": "Taktser, Dokham, Tibet",
      "attributes": {
        "title": "Fourteenth Dalai Lama",
        "religion": "Buddhism"
      }
    },
    {
      "id": "Taktser",
      "type": "location",
      "attributes": {
        "description": "A small village in the northeast of Tibet, about 9,000 feet above sea level, surrounded by fertile fields and mountains."
      }
    },
    {
      "id": "Dokham",
      "type": "region",
      "attributes": {
        "description": "A district in the eastern part of Tibet, characterized by its lower valley areas merging into plains towards China."
      }
    },
    {
      "id": "Ami-chiri",
      "type": "mountain",
      "attributes": {
        "description": "Known locally as 'The Mountain which Pierces the Sky', regarded as the abode of the guardian deity of Taktser."
      }
    },
    {
      "id": "Karma Shar Tsong Ridro",
      "type": "monastery",
      "attributes": {
        "description": "A famous monastery in the religious history of Tibet, associated with Tsongkhapa and Karma Rolpai Dorje."
      }
    },
    {
      "id": "Amdo Ihakyung",
      "type": "monastery",
      "attributes": {
        "description": "A monastery located lower down from Karma Shar Tsong Ridro, notable for its gilded roofs and dharma chakra."
      }
    },
    {
      "id": "Thupten Gyatso",
      "type": "person",
      "birthDate": "1876",
      "deathDate": "1933",
      "attributes": {
        "title": "Thirteenth Dalai Lama",
        "contributions": "Defined Tibet's status as an independent nation and improved the welfare of its people."
      }
    },
    {
      "id": "Kumbum",
      "type": "monastery",
      "attributes": {
        "description": "One of the largest monasteries in the neighborhood of Taktser, involved in the search for the new Dalai Lama."
      }
    },
    {
      "id": "Lhamoi Latso",
      "type": "lake",
      "attributes": {
        "description": "A sacred lake in Tibet where visions of the future are believed to appear."
      }
    },
    {
      "id": "Potala Palace",
      "type": "palace",
      "attributes": {
        "location": "Lhasa, Tibet",
        "description": "The capital palace of Tibet, where the golden mausoleum for Thupten Gyatso was erected."
      }
    },
    {
      "id": "Norbulingka",
      "type": "residence",
      "attributes": {
        "description": "The summer residence of the Dalai Lama in Lhasa."
      }
    }
  ],
  "edges": [
    {
      "source": "Dalai Lama",
      "target": "Taktser",
      "relationship": "bornIn"
    },
    {
      "source": "Taktser",
      "target": "Dokham",
      "relationship": "locatedIn"
    },
    {
      "source": "Ami-chiri",
      "target": "Taktser",
      "relationship": "near"
    },
    {
      "source": "Karma Shar Tsong Ridro",
      "target": "Dokham",
      "relationship": "locatedIn"
    },
    {
      "source": "Amdo Ihakyung",
      "target": "Dokham",
      "relationship": "locatedIn"
    },
    {
      "source": "Thupten Gyatso",
      "target": "Potala Palace",
      "relationship": "buriedIn"
    },
    {
      "source": "Dalai Lama",
      "target": "Norbulingka",
      "relationship": "residedIn"
    }
  ]
}
tenzin3 commented 1 month ago

Problem faced

Before when we are sending content of one chapter at once to the prompt, it llm(chatgpt) didint generated enough triplets.

Solution (Splitting Method): Replacing pronoun such as I, he, she, they,it to their proper entity name and then splitting it to chunks and then sending each chunk with different request to able to generate triplets.

{"role": "system", "content": """# Text Chunking Instructions for GPT-4

                        ## 1.**Modification**: Replace all pronouns such as "he," "she," "they," "it," "our," etc., with the corresponding entities to maintain clarity in the text without changing the context. For example, convert "David is a good guy.He was a teacher during his 20s." to "David is a good guy.David was a teacher during his 20s"
                        ## 2.**Reasonable Length**: Split the text into the chunks to a length of approximately 100 to 150 words. 
                        ## 3. Output Format
                        - **New Line Separation**: Provide each chunk as a separate line in the output.
                        """},
            {"role": "user", "content": f"Use the given format to split the following input: [INPUT START] {input} [INPUT END]"},
            {"role": "user", "content": "Tip: Ensure that each chunk is returned on a new line. And no pronouns are used."},],

Conclusion While for some chunks it was producing very good result(Chunk 1), some (Chunk 2)because of very less context it was not producing factual triplets.

Chunk 1 triplets

{
  "nodes": [
    {
      "id": "Dalai Lama",
      "label": "person",
      "properties": {
        "birthDate": "1935",
        "birthPlace": "Taktser"
      }
    },
    {
      "id": "Taktser",
      "label": "location",
      "properties": {
        "description": "a small village in the northeast of Tibet",
        "altitude": "9000 feet above the sea"
      }
    },
    {
      "id": "Dokham",
      "label": "location",
      "properties": {
        "description": "a district in Tibet where the mountains begin to descend to the plains of the east, towards China",
        "meaning": "the lower part of a valley that merges into the plains (Do) and the eastern part of Tibet (Kham)"
      }
    },
    {
      "id": "Khampa",
      "label": "ethnic group",
      "properties": {
        "location": "Kham, eastern part of Tibet"
      }
    }
  ]
}

Chunk 2 triplets

{
  "nodes": [
    {
      "id": "village",
      "label": "village",
      "properties": {
        "location": "on a little plateau"
      }
    },
    {
      "id": "Ami-chiri",
      "label": "mountain",
      "properties": {
        "alternativeName": "The Mountain which Pierces the Sky",
        "description": "regarded as the abode of the guardian deity of the place",
        "features": "lower slopes covered by forests, above them a rich growth of grass, higher still bare rock, summit with a patch of snow which never melted"
      }
    },
    {
      "id": "plateau",
      "label": "plateau",
      "properties": {
        "surroundings": "encircled by fertile fields of wheat and barley",
        "location": "surrounded by ranges of hills covered by grass thick and vividly green"
      }
    }
  ],
  "relationships": [
    {
      "source": "village",
      "target": "plateau",
      "type": "locatedOn"
    },
    {
      "source": "Ami-chiri",
      "target": "village",
      "type": "southOf"
    }
  ]
}
tenzin3 commented 1 month ago

Finding

Thought chatgpt-4o takes 128k tokens, when asked to perform chunking it was only generating few parts of inputs(starting from top).Such that if you give input text, they were doing what to asked to only the first (~17-18KB)[16534-17396 characters] of the file.

tenzin3 commented 1 month ago

@teny19

The paper called TEXTBOOK TO TRIPLES has an approach where with the text, it also give glossary or terms from which the LLM is supposed to create triplets from. And they also have techniques such as verifying that relation/predicate is an verb or not with Spacy.

So from paper, following are three points which might help us in generating triplets

tenzin3 commented 1 month ago

Extracting entities from (my land my people first page)

result

['Taktser: Location', 'Ami-chiri: Location', 'China: Location', 'Dalai Lama: Person', 'Dokham: Location', 'The Mountain which Pierces the Sky: Location', 'Tibet: Location']

image

Image

prompt

prompt = f"""
            ## Objective:
            You are a top-tier algorithm designed for extracting all entities from the input text to build a knowledge graph.

            ## Instructions:
            -Extract key entities from the following text that are capable of having properties, and identify their types (e.g., Person, Organization, Location, Event, etc.). 
            -Exclude non-entity elements like dates or simple attributes. 
            -Other than the entities, don't include any other information.
            -The text content is book by Dalai Lama.So the pronoun 'I' refers to Dalai Lama.

            ## Output format:
            Entity Name: Entity Type
            Entity Name: Entity Type
            .
            .

            [INPUT TEXT START]
            {text}
            [INPUT TEXT END]

    """
tenzin3 commented 4 weeks ago

@ta4tsering and @10zinten, The way we installed terminusdb through snap was maybe not ideal.

Image

tenzin3 commented 2 weeks ago

Coreference resolution (CR) is the task of finding all linguistic expressions (called mentions) in a given text that refer to the same real-world entity.The more simple explanation could be replacing all the pronoun(it, he, our, she, her) to its real entity meaning.

Image

According to this article, it has tried maverick-coref and spacy coreferee and spacy performed better.

After comparing with spacy coreferee and llm(chatgpt), chatgpt clearly performed better.

Spacy response

Image

LLM response

Image

Comparison

Spacy coreferee

import coreferee, spacy

def coref_text(text):
    coref_nlp = spacy.load('en_core_web_md')
    coref_nlp.add_pipe('coreferee')

    coref_doc = coref_nlp(text)
    resolved_text = ""

    for token in coref_doc:
        repres = coref_doc._.coref_chains.resolve(token)
        if repres:
            resolved_text += " " + " and ".join(
                [
                    t.text
                    if t.ent_type_ == ""
                    else [e.text for e in coref_doc.ents if t in e][0]
                    for t in repres
                ]
            )
        else:
            resolved_text += " " + token.text

    return resolved_text

Important Note: Due to version confliction iwth coreferee, spacy language model larger than en_core_web_md couldnt be used.

tenzin3 commented 1 week ago

Knowledge Graph Visualization

Input Text: My land and my people(First Page) Image

{
    "nodes": [
        {
            "label": "Taktser",
            "type": "Location",
            "properties": {
                "altitude": "9000 feet above the sea"
            }
        },
        {
            "label": "Tibet",
            "type": "Location"
        },
        {
            "label": "WoodHogYear",
            "type": "Event",
            "properties": {
                "gregorianCalendarYear": "1935"
            }
        },
        {
            "label": "Dokham",
            "type": "Location",
            "properties": {
                "description": "the lower part of a valley that merges into the plains"
            }
        },
        {
            "label": "Khampa",
            "type": "EthnicGroup"
        },
        {
            "label": "China",
            "type": "Location"
        },
        {
            "label": "AmiChiri",
            "type": "Location"
        },
        {
            "label": "TheMountainWhichPiercesTheSky",
            "type": "Location"
        },
        {
            "label": "DalaiLama",
            "type": "Person",
            "properties": {
                "birthDate": "fifth day of the fifth month of the Wood Hog Year"
            }
        }
    ],
    "relationships": [
        {
            "source": "Taktser",
            "target": "Tibet",
            "type": "IsIn"
        },
        {
            "source": "Dokham",
            "target": "Tibet",
            "type": "IsThePartOf"
        },
        {
            "source": "Dokham",
            "target": "China",
            "type": "BeginToDescend"
        },
        {
            "source": "AmiChiri",
            "target": "TheMountainWhichPiercesTheSky",
            "type": "IsCalled"
        },
        {
            "source": "TheMountainWhichPiercesTheSky",
            "target": "Taktser",
            "type": "IsSurroundedBy"
        },
        {
            "source": "DalaiLama",
            "target": "Taktser",
            "type": "WasBornIn"
        },
        {
            "source": "DalaiLama",
            "target": "WoodHogYear",
            "type": "WasOn"
        }
    ]
}
tenzin3 commented 1 week ago

More modifications -Spacing in graph Visualization -Rather than extracting possible relations first and then choosing from them in building the triplets, gave more flexibility on llm to choose their own relations based on given entities. And result is much better.

Image

{
    "nodes": [
        {
            "label": "AmiChiri",
            "type": "Location",
            "attributes": {
                "description": "The Mountain which Pierces the Sky, regarded as the abode of the guardian deity of the place",
                "elevation": "9000 feet above the sea",
                "vegetation": "forests, rich growth of grass, bare rock, snow on the summit",
                "wildlife": "junipers, poplars, peaches, plums, walnuts, berries, scented flowers, deer, wild asses, monkeys, leopards, bears, foxes"
            }
        },
        {
            "label": "China",
            "type": "Location"
        },
        {
            "label": "DalaiLama",
            "type": "Person",
            "attributes": {
                "birthDate": "1935",
                "birthPlace": "Taktser"
            }
        },
        {
            "label": "Dokham",
            "type": "Location",
            "attributes": {
                "description": "The lower part of a valley that merges into the plains, eastern part of Tibet"
            }
        },
        {
            "label": "Khampa",
            "type": "EthnicGroup",
            "attributes": {
                "location": "Dokham"
            }
        },
        {
            "label": "Taktser",
            "type": "Location",
            "attributes": {
                "description": "A small village in the northeast of Tibet",
                "elevation": "9000 feet above the sea",
                "environment": "surrounded by fertile fields of wheat and barley, encircled by ranges of hills"
            }
        },
        {
            "label": "Tibet",
            "type": "Location"
        },
        {
            "label": "WoodHogYear",
            "type": "Event",
            "attributes": {
                "date": "1935"
            }
        }
    ],
    "edges": [
        {
            "source": "DalaiLama",
            "target": "Taktser",
            "relation": "WasBornIn"
        },
        {
            "source": "Taktser",
            "target": "Dokham",
            "relation": "IsLocatedIn"
        },
        {
            "source": "Dokham",
            "target": "Tibet",
            "relation": "IsPartOf"
        },
        {
            "source": "Khampa",
            "target": "Dokham",
            "relation": "LivesIn"
        },
        {
            "source": "Dokham",
            "target": "China",
            "relation": "DescendsTo"
        },
        {
            "source": "DalaiLama",
            "target": "WoodHogYear",
            "relation": "WasBornIn"
        },
        {
            "source": "AmiChiri",
            "target": "Taktser",
            "relation": "IsSouthOf"
        }
    ]
}