UW-xDD / text2graph_llm

An experimental API endpoint to convert text to knowledge graph triplets.
MIT License
2 stars 1 forks source link

Solve entity alignment problem #14

Closed JasonLo closed 3 months ago

JasonLo commented 4 months ago

To align entities extracted from a large language model (LLM) with those in your database when exact matching and lemmatization have not been effective, consider the following approach:

Full Database Extraction: Extract entities from the entire database to ensure a comprehensive set of data for alignment. Embedding Selection: Choose suitable embeddings that capture the semantic similarity of entities. For geographical entities, geo-embeddings might be useful, whereas general-purpose embeddings (like word2vec, GloVe, or BERT) could work well for a broader range of entities. Clustering and Semi-Manual Alignment: Utilize clustering algorithms to group similar entities based on their embeddings. This step helps identify potential matches or alignments between the extracted entities and those in your database. After clustering, perform a semi-manual review to finalize the entity alignment, making adjustments as necessary to ensure accuracy.

JasonLo commented 3 months ago

In https://github.com/UW-xDD/text2graph_llm/tree/entity_alignment/experiments/entitiy_alignment/v0

I did some testing to see whether a embedding-based entity alignment will helps cleaning up the extracted entity from the LLM pipeline.

Problem Statement

Entities extracted from Large Language Models (LLMs) often lack consistency, leading to messy data. Our goal is to standardize these entities into a canonical form to ensure they reference the same concept.

Procedure

  1. Extract Entities: Identify and isolate entities from the data provided by the LLM.
  2. Project to Semantic Space: Map these entities onto a semantic space where they can be analyzed based on meaning.
  3. Define Canonical Form: Determine the canonical form of entities by measuring the semantic distance between them. This involves setting a similarity threshold manually to decide when entities are considered equivalent.

Summary of findings

To-do

Raw details

JasonLo commented 3 months ago

Reviewed on Apr1, follow up on