Solve entity alignment problem

JasonLo commented 4 months ago

To align entities extracted from a large language model (LLM) with those in your database when exact matching and lemmatization have not been effective, consider the following approach:

Full Database Extraction: Extract entities from the entire database to ensure a comprehensive set of data for alignment. Embedding Selection: Choose suitable embeddings that capture the semantic similarity of entities. For geographical entities, geo-embeddings might be useful, whereas general-purpose embeddings (like word2vec, GloVe, or BERT) could work well for a broader range of entities. Clustering and Semi-Manual Alignment: Utilize clustering algorithms to group similar entities based on their embeddings. This step helps identify potential matches or alignments between the extracted entities and those in your database. After clustering, perform a semi-manual review to finalize the entity alignment, making adjustments as necessary to ensure accuracy.

JasonLo commented 3 months ago

In https://github.com/UW-xDD/text2graph_llm/tree/entity_alignment/experiments/entitiy_alignment/v0

I did some testing to see whether a embedding-based entity alignment will helps cleaning up the extracted entity from the LLM pipeline.

Problem Statement

Entities extracted from Large Language Models (LLMs) often lack consistency, leading to messy data. Our goal is to standardize these entities into a canonical form to ensure they reference the same concept.

Procedure

Extract Entities: Identify and isolate entities from the data provided by the LLM.
Project to Semantic Space: Map these entities onto a semantic space where they can be analyzed based on meaning.
Define Canonical Form: Determine the canonical form of entities by measuring the semantic distance between them. This involves setting a similarity threshold manually to decide when entities are considered equivalent.

Summary of findings

The performance of embedding alignment methods seems similar to basic word parsers, as embeddings primarily encode information based on the concrete words (e.g., "granite", "sandstone", "limestone", "volcanics", " -member", " -formation") themselves without much additional context from the "name" parts.
Setting the similarity threshold at 0.9 is recommended to lower the risk of mistakenly associating new terms with known entities.
If creating new objects in Macrostrat is a goal, it might be necessary to have a human expert review a list of entities considered high risk.

To-do

Implement dynamic prompting for exact match scenarios.
Implement alignment based on known entity embeddings with a 0.9 similarity threshold to capture more known entities.

Raw details

JasonLo commented 3 months ago

Reviewed on Apr1, follow up on

[x] Use a higher threshold to start, ~= 0.95

UW-xDD / text2graph_llm