similarity between n-gram vectors (Metric)

SforAiDl / decepticonlp

Python Library for Robustness Monitoring and Adversarial Debugging of NLP models

MIT License

15 stars 10 forks source link

similarity between n-gram vectors (Metric) #30

Closed parantak closed 4 years ago

parantak commented 4 years ago

I believe one of us could implement the following:

Extract n-grams from both sentences.
Construct respective vectors for both sentences, where a '1' or a '0' would indicate the presence or non-presence of an n-gram.
Calculate cosine similarity.

There's literature where people have used this metric, I can't seem to find as of now. I will update it later. Note: could also use all (1,2...k) grams to capture more context, comes as the cost of more computation time.

someshsingh22 commented 4 years ago

Yeah please update it, also this is very similar to Jaccard distance anyways Jaccard consists of unigrams, maybe we can pass a window size. unigrams anyways only work for words and not strings, nice suggestion!