Closed JasonLo closed 3 months ago
In https://github.com/UW-xDD/text2graph_llm/tree/entity_alignment/experiments/entitiy_alignment/v0
I did some testing to see whether a embedding-based entity alignment will helps cleaning up the extracted entity from the LLM pipeline.
Entities extracted from Large Language Models (LLMs) often lack consistency, leading to messy data. Our goal is to standardize these entities into a canonical form to ensure they reference the same concept.
Reviewed on Apr1, follow up on
To align entities extracted from a large language model (LLM) with those in your database when exact matching and lemmatization have not been effective, consider the following approach:
Full Database Extraction: Extract entities from the entire database to ensure a comprehensive set of data for alignment. Embedding Selection: Choose suitable embeddings that capture the semantic similarity of entities. For geographical entities, geo-embeddings might be useful, whereas general-purpose embeddings (like word2vec, GloVe, or BERT) could work well for a broader range of entities. Clustering and Semi-Manual Alignment: Utilize clustering algorithms to group similar entities based on their embeddings. This step helps identify potential matches or alignments between the extracted entities and those in your database. After clustering, perform a semi-manual review to finalize the entity alignment, making adjustments as necessary to ensure accuracy.