Open MeoMix opened 10 years ago
Dice's similarity coefficient:
var sSimilarity = function(sa1, sa2){
// Compare two strings to see how similar they are.
// Answer is returned as a value from 0 - 1
// 1 indicates a perfect similarity (100%) while 0 indicates no similarity (0%)
// Algorithm is set up to closely mimic the mathematical formula from
// the article describing the algorithm, for clarity.
// Algorithm source site: http://www.catalysoft.com/articles/StrikeAMatch.html
// (Most specifically the slightly cryptic variable names were written as such
// to mirror the mathematical implementation on the source site)
//
// 2014-04-03
// Found out that the algorithm is an implementation of the Sørensen–Dice coefficient [1]
// [1] http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
//
// The algorithm is an n-gram comparison of bigrams of characters in a string
// for my purposes, comparison should not check case or whitespace
var s1 = sa1.replace(/\s/g, "").toLowerCase();
var s2 = sa2.replace(/\s/g, "").toLowerCase();
function intersect(arr1, arr2) {
// I didn't write this. I'd like to come back sometime
// and write my own intersection algorithm. This one seems
// clean and fast, though. Going to try to find out where
// I got it for attribution. Not sure right now.
var r = [], o = {}, l = arr2.length, i, v;
for (i = 0; i < l; i++) {
o[arr2[i]] = true;
}
l = arr1.length;
for (i = 0; i < l; i++) {
v = arr1[i];
if (v in o) {
r.push(v);
}
}
return r;
}
var pairs = function(s){
// Get an array of all pairs of adjacent letters in a string
var pairs = [];
for(var i = 0; i < s.length - 1; i++){
pairs[i] = s.slice(i, i+2);
}
return pairs;
}
var similarity_num = 2 * intersect(pairs(s1), pairs(s2)).length;
var similarity_den = pairs(s1).length + pairs(s2).length;
var similarity = similarity_num / similarity_den;
return similarity;
};
An interesting edge case to think about with this approach will be when two video titles correspond to different songs, but are still highly similar. This might happen with video titles including long artist names and/or similar song names. As an extreme example, comparing "My Morning Jacket 'Touch Me I'm Going To Scream Part II'" with "My Morning Jacket - Touch Me I'm Going To Scream, Part 1" gives a similarity score of ~0.89 using the above implementation.
It's definitely worth considering, but I think it's much preferred to not repeat an existing song than to miss a different song when using radio mode. You'll never know what songs you miss, but you do know if you've heard it before :)
On 5/29/15, brianlyn notifications@github.com wrote:
An interesting edge case to think about with this approach will be when two video titles correspond to different songs, but are still highly similar. This might happen with video titles including long artist names and/or similar song names. As an extreme example, comparing "My Morning Jacket 'Touch Me I'm Going To Scream Part II'" with "[My Morning Jacket
- Touch Me I'm Going To Scream, Part 1](https://www.youtube.com/watch?v=y3UlY8PKUB8)" gives a similarity score of ~0.89 using the above implementation.
Reply to this email directly or view it on GitHub: https://github.com/MeoMix/StreamusChromeExtension/issues/173#issuecomment-106905735
Consider the following:
Gramatik - Just Jammin': https://www.youtube.com/watch?v=lYKRPzOi1zI Gramatik - Just Jammin: https://www.youtube.com/watch?v=NzsRGifbHYw (HQ) Gramatik - Just Jammin' [Street Bangerz Vol. 2]: https://www.youtube.com/watch?v=_7qhdcaX8Q0
These three videos all represent the same song. When running Streamus in Radio Mode, Streamus needs to be able to avoid songs which are already in the stream. Currently, it does so by checking for exact video ID and comparing 'cleaned' titles:
https://github.com/MeoMix/StreamusChromeExtension/blob/master/src/js/background/collection/streamItems.js#L345
Here's the logic for cleaning a title:
https://github.com/MeoMix/StreamusChromeExtension/blob/master/src/js/common/model/utility.js#L52
This is a good start, but I think we also need to calculate the Levenshtein Distance, http://en.wikipedia.org/wiki/Levenshtein_distance, between the two names while also taking into account the length of the name to determine a good # for the distance.
Maybe that isn't the right algorithm to use. Open to other ideas, but it would be nice if someone could look into this!