I believe one of us could implement the following:
Extract n-grams from both sentences.
Construct respective vectors for both sentences, where a '1' or a '0' would indicate the presence or non-presence of an n-gram.
Calculate cosine similarity.
There's literature where people have used this metric, I can't seem to find as of now. I will update it later.
Note: could also use all (1,2...k) grams to capture more context, comes as the cost of more computation time.
Yeah please update it, also this is very similar to Jaccard distance anyways Jaccard consists of unigrams, maybe we can pass a window size. unigrams anyways only work for words and not strings, nice suggestion!
I believe one of us could implement the following:
There's literature where people have used this metric, I can't seem to find as of now. I will update it later. Note: could also use all (1,2...k) grams to capture more context, comes as the cost of more computation time.