MeoMix / StreamusChromeExtension

A YouTube video player as a Google Chrome extension
http://streamus.com
Apache License 2.0
1.08k stars 216 forks source link

Algorithm for song title comparison #173

Open MeoMix opened 10 years ago

MeoMix commented 10 years ago

Consider the following:

Gramatik - Just Jammin': https://www.youtube.com/watch?v=lYKRPzOi1zI Gramatik - Just Jammin: https://www.youtube.com/watch?v=NzsRGifbHYw (HQ) Gramatik - Just Jammin' [Street Bangerz Vol. 2]: https://www.youtube.com/watch?v=_7qhdcaX8Q0

These three videos all represent the same song. When running Streamus in Radio Mode, Streamus needs to be able to avoid songs which are already in the stream. Currently, it does so by checking for exact video ID and comparing 'cleaned' titles:

https://github.com/MeoMix/StreamusChromeExtension/blob/master/src/js/background/collection/streamItems.js#L345

Here's the logic for cleaning a title:

https://github.com/MeoMix/StreamusChromeExtension/blob/master/src/js/common/model/utility.js#L52

This is a good start, but I think we also need to calculate the Levenshtein Distance, http://en.wikipedia.org/wiki/Levenshtein_distance, between the two names while also taking into account the length of the name to determine a good # for the distance.

Maybe that isn't the right algorithm to use. Open to other ideas, but it would be nice if someone could look into this!

MeoMix commented 9 years ago

Dice's similarity coefficient:

var sSimilarity = function(sa1, sa2){
    // Compare two strings to see how similar they are.
    // Answer is returned as a value from 0 - 1
    // 1 indicates a perfect similarity (100%) while 0 indicates no similarity (0%)
    // Algorithm is set up to closely mimic the mathematical formula from
    // the article describing the algorithm, for clarity. 
    // Algorithm source site: http://www.catalysoft.com/articles/StrikeAMatch.html
    // (Most specifically the slightly cryptic variable names were written as such
    // to mirror the mathematical implementation on the source site)
    //
    // 2014-04-03
    // Found out that the algorithm is an implementation of the Sørensen–Dice coefficient [1]
    // [1] http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
    //
    // The algorithm is an n-gram comparison of bigrams of characters in a string

    // for my purposes, comparison should not check case or whitespace
    var s1 = sa1.replace(/\s/g, "").toLowerCase();
    var s2 = sa2.replace(/\s/g, "").toLowerCase();

    function intersect(arr1, arr2) {
        // I didn't write this.  I'd like to come back sometime
        // and write my own intersection algorithm.  This one seems
        // clean and fast, though.  Going to try to find out where
        // I got it for attribution.  Not sure right now.
        var r = [], o = {}, l = arr2.length, i, v;
        for (i = 0; i < l; i++) {
            o[arr2[i]] = true;
        }
        l = arr1.length;
        for (i = 0; i < l; i++) {
            v = arr1[i];
            if (v in o) {
                r.push(v);
            }
        }
        return r;
    }

    var pairs = function(s){
        // Get an array of all pairs of adjacent letters in a string
        var pairs = [];
        for(var i = 0; i < s.length - 1; i++){
            pairs[i] = s.slice(i, i+2);
        }
        return pairs;
    }    

    var similarity_num = 2 * intersect(pairs(s1), pairs(s2)).length;
    var similarity_den = pairs(s1).length + pairs(s2).length;
    var similarity = similarity_num / similarity_den;
    return similarity;
};
brianlyn commented 9 years ago

An interesting edge case to think about with this approach will be when two video titles correspond to different songs, but are still highly similar. This might happen with video titles including long artist names and/or similar song names. As an extreme example, comparing "My Morning Jacket 'Touch Me I'm Going To Scream Part II'" with "My Morning Jacket - Touch Me I'm Going To Scream, Part 1" gives a similarity score of ~0.89 using the above implementation.

MeoMix commented 9 years ago

It's definitely worth considering, but I think it's much preferred to not repeat an existing song than to miss a different song when using radio mode. You'll never know what songs you miss, but you do know if you've heard it before :)

On 5/29/15, brianlyn notifications@github.com wrote:

An interesting edge case to think about with this approach will be when two video titles correspond to different songs, but are still highly similar. This might happen with video titles including long artist names and/or similar song names. As an extreme example, comparing "My Morning Jacket 'Touch Me I'm Going To Scream Part II'" with "[My Morning Jacket


Reply to this email directly or view it on GitHub: https://github.com/MeoMix/StreamusChromeExtension/issues/173#issuecomment-106905735