AuvaLab / itext2kg

Incremental Knowledge Graphs Constructor Using Large Language Models
GNU Lesser General Public License v2.1
546 stars 52 forks source link

Good stuff improve the relationships #3

Closed vtempest closed 1 month ago

vtempest commented 1 month ago

Https://airesearch.wiki check out the topic model which extracts phrases from wiki

How are you extracting relationship summaries updating the edges?

https://omar-hussein.medium.com/relational-inductive-biases-deep-learning-and-graph-networks-by-battaglia-et-al-2018-overview-8c4fc89395bb Check out this graph net paper

lairgiyassir commented 1 month ago

Hello,

Thank you for these additional references. I refer you to the Section 3 of our article.

The Incremental Relations Extractor (iRelations Matcher) is a module designed to extract relationships between entities dynamically and iteratively.

The iRelations Matcher takes the Local (or Global) Document Entities (a set of entities extracted from (all) semantic blocks) as context and combines them with each new Semantic Block.

Two Different Contexts for Relation Extraction: The iRelations Matcher can use two types of context for extracting relationships:

a. Global Entities as Context: When the global entities (all entities collected across all documents) are used as context, the LLM is prompted to identify relationships:

Advantages: This method allows the LLM to identify not only the obvious relationships but also those that are suggested or implied, potentially enriching the knowledge graph with new information. Disadvantages: It can lead to an increased number of irrelevant or spurious relations since the model might infer relationships that are not directly supported by the text in the Semantic Block.

b. Local Entities as Context: When locally matched entities (entities matched with the global set from the current Semantic Block only) are used as context: The LLM is more constrained and only extracts relationships that are directly stated within the current Semantic Block. Advantages: This approach reduces the number of irrelevant relationships, making the extracted relationships more precise and contextually accurate. Disadvantages: It may limit the richness and comprehensiveness of the knowledge graph because fewer implied relationships are extracted.

Iterative Extraction Process: After processing, the extracted relationships are added to the global set of relationships, continuously enriching and updating the knowledge graph.

vtempest commented 1 month ago

Thanks for explaining 🙂 You've really thought a lot about graph theory and I would be excited to share and collaborate more on seektopic algorithm and while growing my qwksearch.com startup in SF through startup accelerator funding us. I'm looking to integrate infranodus cosmograph asknews and this itext2kg for a umap based graph.

https://youtu.be/9WFUF13zItw?si=z8hz7oYXrZ6og1Yp

https://arxiv.org/abs/2409.03155 DoG addresses two significant challenges in existing methods: (1) long reasoning paths that distract from answer generation, and (2) false-positive relations that hinder path refinement. To overcome these challenges, DoG uses a subgraph-focusing mechanism that allows LLMs to perform answer attempts after each reasoning step, reducing the impact of lengthy reasoning paths. Additionally, DoG employs a multi-role debate team to simplify complex questions, reducing the influence of false-positive relations.

====

To summarize your answer is "we let llm decide the connection label and hope for the best." Getting the connection links is the hard part. We can't trust llms to make up a label connecting 2 nodes since everything has to be grounded in supporting reference citation. Your answer begs the question of how you decide what is relevant context to let llm decide that label. What if the text is super long? So you have 500 nodes, do you know how many times you'd have to ask llm for each connection? How does this scale to millions of nodes in user text content? How will it know when to update that connection when new data comes in?

MY SOLUTION That's the whole point of seektopic extraction algorithm that combines Text Rank weighted graph theory and LDA topic terms pairings to extract sentences and weight how much they centralize key phrases, which are the emergent topic labels and node entities. This is better than NNs like gliner or Transformers.js which do not give weights and sentences, just the Proper Noun Entities.

My big impact breakthrough: I am using wikipedia pages for the titles of the possible entities and the linked pages mentioned on that entity's page as the second order of possible connections. Then use those as candidates and find weighted sentences that most centralize those 2 topics.

lairgiyassir commented 1 month ago

Thank you for your suggestion. I will dive deeper into the SeekTopic algorithm. iText2KG is still in its early stages of knowledge graph construction. We haven't implemented reasoning on graphs yet, but it is part of our research!

"We let the LLM decide the connection label and hope for the best." The short answer is yes, but the LLM extracts nodes and relationships based on "distilled semantic blocks." These blocks are derived from a blueprint, which means the LLM is "guided" and "biased" towards specific important aspects during construction. As a result, the user effectively consents to the nodes and relations being extracted based on that blueprint. This is the main objective of the first module, the "Document Distiller."

`"How do you determine the relevant context for the LLM to decide on the label?" Again, this is the core function of the Document Distiller. The user customizes the relevant context by selecting the sections they deem important for the LLM to process for KG construction. This flexibility in blueprint selection allows the algorithm to handle various use cases. In a Medium article, Anthony Alcaraz (https://medium.com/codex/universal-continuous-knowledge-graph-builder-a-new-paradigm-for-information-structuring-2559773ea0cf) suggested enhancing the Document Distiller by using "propositions." He cited the "Dense X Retrieval" paper (https://arxiv.org/pdf/2312.06648), which defines propositions as "atomic expressions within the text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format." In other words, a proposition is the smallest unit of information conveying a complete piece of knowledge. I agree with him: if we could distill the input documents by selecting the most relevant parts and extracting these atomic expressions, KG construction would be simpler, as the information would be direct, concise, and less complex.

"What if the text is very long? If you have 500 nodes, how many times would you need to ask the LLM for each connection? How does this scale to millions of nodes? How does it know when to update connections when new data comes in?" For long texts, we should not pass everything to the LLM at once unless it is already distilled with clear separation between semantic blocks (e.g., a distilled CV). Instead, we pass the semantic blocks iteratively to the LLM for KG construction. It’s unlikely that a single semantic block would be very long, so using iText2KG, the graph is constructed incrementally, step by step, rather than in one go. This incremental approach allows the LLM to process each node and relation carefully without being overwhelmed by excessive information.

Graph Update Methodology In the latest release (https://github.com/AuvaLab/itext2kg/releases/tag/v0.0.4), I added the capability to construct a graph based on pre-existing embedded nodes and relations. You can also check this use-case example: https://github.com/AuvaLab/itext2kg/blob/main/examples/different_llm_models.ipynb.

Let’s say we have an existing graph (nodes and relationships) and want to add new semantic blocks to it. First, we extract nodes and relationships incrementally (as described in the article). Once the graph is constructed, the goal is to update it. We match entities between the two graphs, replace the matched entities in the relations list, and then match relations, only updating the relation label (without replacing all relations).

This process can certainly be further improved! Regarding the final part of your question, scaling and speed are indeed challenges for iText2KG, but we're actively working on solutions.

"I'm using Wikipedia pages for the titles of possible entities and the linked pages on each entity’s page as a second order of possible connections." Yes, that’s an intelligent method for enhancing the connections already identified by the LLM. I aim to keep it incremental to avoid excessive post-processing techniques.