Programmatically filtering "false" hits

caguillen214 commented 10 years ago

Saw this project on reddit and thought it was really awesome. Keep up the good work, its a great a idea. Anyway, I saw that you said you were manually rejecting "false" hits which included identical tweets or transpositions of few letters and I thought to myself that could be an easy fix.

I wrote a simple abstracted implementation to remove some of the "false" hits (in javascript) of this mainly because I was too lazy to look how to implement it in python or how to integrate it with your project and submit a PR haha.

Hopefully it helps. Maybe at least it keeps you from having to think of a solution haha. Awesome project!

Carlos

//tweets would have to be stripped of things you exclude mentioned in your readme like @'s
function tooSimilar(tweet1, tweet2) {
    var LD_THRESHOLD = 4, //max changes between the tweets (insertions, deletions, transpositions)
        TWEET_WORD_SIMILARITY = 3, //max number of words in common
        matchCount = 0;

    tweet1.split(' ').map(function(word) {
        //ignore words like pronouns, articles, etc.
        if(IGNORED_WORDS_HASH[word] === undefined && tweet2.indexOf(word) > -1) { 
            matchCount++;
        }
    });

   //lev_Dist can be expensive so we want to kick out early if possible
    if(matchCount > TWEET_WORD_SIMILARITY) {
        return true;
    }

    var lev_Dist = levenshtein(tweet1, tweet2);    
    return (lev_Dist < LD_THRESHOLD);
}   

//credit: http://stackoverflow.com/questions/11919065/sort-an-array-by-the-levenshtein-distance-with-best-performance-in-javascript
//info: http://en.wikipedia.org/wiki/Levenshtein_distance
function levenshtein(s, t) {
      var d = []; //2d matrix

    // Step 1
    var n = s.length;
    var m = t.length;

    if (n == 0) return m;
    if (m == 0) return n;

    //Create an array of arrays in javascript (a descending loop is quicker)
    for (var i = n; i >= 0; i--) d[i] = [];

    // Step 2
    for (var i = n; i >= 0; i--) d[i][0] = i;
    for (var j = m; j >= 0; j--) d[0][j] = j;

    // Step 3
    for (var i = 1; i <= n; i++) {
        var s_i = s.charAt(i - 1);

        // Step 4
        for (var j = 1; j <= m; j++) {

            //Check the jagged ld total so far
            if (i == j && d[i][j] > 4) return n;

            var t_j = t.charAt(j - 1);
            var cost = (s_i == t_j) ? 0 : 1; // Step 5

            //Calculate the minimum
            var mi = d[i - 1][j] + 1;
            var b = d[i][j - 1] + 1;
            var c = d[i - 1][j - 1] + cost;

            if (b < mi) mi = b;
            if (c < mi) mi = c;

            d[i][j] = mi; // Step 6

            //Damerau transposition (checks for transposition of letters e.g. haet and hate have a Damerau-Lev distance
            // of 1 instead of Lev distance of 2). Can be removed for optimization
            if (i > 1 && j > 1 && s_i == t.charAt(j - 2) && s.charAt(i - 2) == t_j) {
                d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost);
            }
        }
    }

    // Step 7
    return d[n][m];
}

cmyr commented 10 years ago

hey sorry I've been super busy, but thanks for posting this. I didn't know about levenshtein distance when I first wrote this, and I ended up writing a bunch of shitty comparison filters. I'll eventually implement this change, but it'll have to wait a little bit. Thanks again for the note though!

JoshKeegan commented 9 years ago

This would be a really nice addition!

I don't think there's any need to consider addition or deletion of characters though, since for two messages to have gotten this far they must be equal in length. So you could use the Hamming distance instead of the Levenshtein distance which would make this trivial to implement.

The only thing worth nothing is that the implementation of the Levenshtein distance in the previous post is actually the Damerau–Levenshtein distance, which counts character transposition as a single operation. Since all operations required to convert between the two anagram strings are transpositions, the distance returned from using just the Levenshtein or Hamming distance will be half that of the Damerau–Levenshtein distance, so the threshold to decide whether or not the strings are too similar will need to be doubled.

caguillen214 commented 9 years ago

That's a great point, and thanks for the info! I hadn't heard of the hamming distance, I'll definitely keep that in my rolodex of nifty tricks

cmyr commented 9 years ago

Looks like I’m going to have to actually get around to doing this now, huh?

caguillen214 commented 9 years ago

Yeah I believe two random people commenting about it means it must be done. Internet law stuff

bdrupieski commented 6 years ago

Sorry for digging up an old issue but I wanted to share my own experience with using Damerau–Levenshtein distance on tweets in case it helps anyone here.

Like @caguillen214 I saw this project and thought Damerau–Levenshtein distance would be a great way to score anagrammed tweets. I went ahead and implemented it, including Damerau–Levenshtein distance as one metric in a normalized score of how interesting a candidate anagram match is. After processing the tweet text to normalize Unicode characters, strip spaces out spaces, strip out weird characters, and convert everything to the same case, this normalized score equally combines:

Total length of the tweets in the match
Length of the longest common substring between the two tweets
Total number of English words across the two tweets
Total number of words across the two tweets
Damerau–Levenshtein distance between the two tweets

I review matches and post some to my own bot account. I've now posted and rejected quite a few anagram matches.

I ran some queries to see how the above 5 metrics affect which anagrams I'm likely to post to the account (meaning, how different/funny/witty/interesting the two tweets in the match seem to me). I've found that the above 5 metrics individually contribute to the likelihood of me posting a match ranked in the order shown above. In other words, length alone is the best indicator, and Damerau–Levenshtein distance is the worst part of my score. If you want to see some numbers and graphs, I included some in the "Scoring Metric Efficiency" section of this blog post. I would probably have a better score using only the first four metrics by completely removing Damerau–Levenshtein distance.

However, the judgment that Damerau–Levenshtein distance is not a good way to measure how interesting two anagrammed tweets are is biased by my own idea of what I should or should not post to my bot account. I've posted a bunch of crappy matches in the past that probably would not have met anagramatron's bar, so take this with a grain of salt.

cmyr commented 6 years ago

@bdrupieski I'm on the same page, and in general I think that while things like edit distance can be a useful screening step, in general the human conception of "what is interesting" is hard to model.

An approach I think would be interesting would be to use a neural network; using word2vec to encode each tweet, and then training a linear classifier on pairs of tweets, using manually approved/rejected pairs as the training/test data. I'm not sure exactly how many manually approved pairs I have, but it's quite a few.

This approach would require a reasonable outlay in engineering time, and I haven't been thinking about this stuff much lately (I was doing something similar to this for work a while ago, though, so I at least have a glimmer that this approach could work...) in any case, it's firmly in the 'maybe one day' pile. :)

cmyr / anagramatron

Programmatically filtering "false" hits #2