Similarity metrics (New)

SforAiDl / decepticonlp

Python Library for Robustness Monitoring and Adversarial Debugging of NLP models

MIT License

15 stars 10 forks source link

Similarity metrics (New) #53

Open parantak opened 4 years ago

parantak commented 4 years ago

Needleman-Wunsch Algorithm
Smith-Waterman Algorithm These algorithms were originally developed for DNA sequencing but I read on SO, that they are at times used as string similarity metrics as well as they account for mismatches and gaps (spaces). Moreover, we can penalize gaps and mismatches according to a value the users choice. Should we implement this, @rajaswa and @someshsingh22? If yes, then I'll do it in some time.

rajaswa commented 4 years ago

It'd be great if we have a source for this. It won't make sense putting efforts into something which won't be used eventually.

parantak commented 4 years ago

https://math.mit.edu/classes/18.417/Slides/alignment.pdf http://sepwww.stanford.edu/public/docs/sep112/bob2/paper_html/node3.html http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#needleman

someshsingh22 commented 4 years ago

@parantak As much as I remember especially Smith-Waterman Algorithm isn't relevant for string, in case of DNA it matches a particular sequence, so if the minority sentence was a discount union of subsets of another string it would give 100% similairty- "My name" and "My name is Somesh Singh" will be 100% same (Smith-Waterman looks after exclusion) while Needle-Wunshch looks for inclusion. So personally I don't think it would be that useful. Check what we used for BIOF110 here and you will notice the difference

parantak commented 4 years ago

@someshsingh22 Right, sorry. I vaguely remembered them both. I searched a bit, and I believe Smith-Waterman is a local alignment algorithm whereas Needle-Wunsch is a global alignment algorithm. So, I guess Smith-Waterman might not be as relevant. However, I am sure Needlman-Wunsch should be a good metric because unlike static penalties in Levenshtein, the algorithm implements different penalties for matches, mismatches, and gaps. As for Smith-Waterman, I'll look into it soon to gain a better understanding and to make sure we aren't missing out on anything.

someshsingh22 commented 4 years ago

Yes, that was my point, I don't remember Needleman-Wunsch well either. Do look for some supporting literature in similar domains of NLP before you move on though.

parantak commented 4 years ago

@someshsingh22 Yeah, of course.