SforAiDl / decepticonlp

Python Library for Robustness Monitoring and Adversarial Debugging of NLP models
MIT License
15 stars 10 forks source link

similarity between n-gram vectors (Metric) #30

Closed parantak closed 4 years ago

parantak commented 4 years ago

I believe one of us could implement the following:

  1. Extract n-grams from both sentences.
  2. Construct respective vectors for both sentences, where a '1' or a '0' would indicate the presence or non-presence of an n-gram.
  3. Calculate cosine similarity.

There's literature where people have used this metric, I can't seem to find as of now. I will update it later. Note: could also use all (1,2...k) grams to capture more context, comes as the cost of more computation time.

someshsingh22 commented 4 years ago

Yeah please update it, also this is very similar to Jaccard distance anyways Jaccard consists of unigrams, maybe we can pass a window size. unigrams anyways only work for words and not strings, nice suggestion!