Create KNN Cosine Similarity Function

After the tuples are embedded into vectors for each record, we run a similarity function to decide what the best record pair candidates are. This set of good record pair candidates are then fed into the matching model.

KNN cosine similarity is widely accepted for this step. This involves choosing the K best "right side" candidate match tuples for each "left side" tuple based on the cosine similarity of the tuple embeddings. To start, we'll use a threshold similarity.

Packages like faiss will be helpful for creating this functionality.

catalyst-cooperative / ccai-entity-matching

Create KNN Cosine Similarity Function #48