MaartenGr / PolyFuzz

Fuzzy string matching, grouping, and evaluation.
https://maartengr.github.io/PolyFuzz/
MIT License
733 stars 67 forks source link

Question in the BERT-based approach. #16

Closed NicolasMontes closed 3 years ago

NicolasMontes commented 3 years ago

I have a question in the BERT-based approach.... When you try to get the similarity of "apple" and "app," you need to convert each one into a dense vector (an embedding). But, to get the embedding for "apple" (for example, and the same for "app"), you should see this token in the whole sentence (to get the "contextual embedding"). How they get these embedding if only a word is provided (for example, the token "apple")?

MaartenGr commented 3 years ago

Sorry for the late response! Typically, I would actually not suggest using contextual embeddings for string similarity algorithms. This is, in part, due to what you are saying: The context is missing. Without the context, it becomes more difficult for contextualized embeddings such as BERT to map strings to each other.

Having said that, pre-trained BERT models do have knowledge about many words and can get word embeddings without needing context. Context is merely prefered when using these types of embeddings.

In practice, I am a big fan of pre-trained FastText embeddings and I would personally suggest using those if you are looking for a semantic mapping.