catalyst-cooperative / ccai-entity-matching

An exploration of generalizable approaches to unsupervised entity matching for use in linking tabular public energy data sources.
MIT License
1 stars 2 forks source link

Create KNN Cosine Similarity Function #48

Closed katie-lamb closed 1 year ago

katie-lamb commented 1 year ago

After the tuples are embedded into vectors for each record, we run a similarity function to decide what the best record pair candidates are. This set of good record pair candidates are then fed into the matching model.

KNN cosine similarity is widely accepted for this step. This involves choosing the K best "right side" candidate match tuples for each "left side" tuple based on the cosine similarity of the tuple embeddings. To start, we'll use a threshold similarity.

Packages like faiss will be helpful for creating this functionality.