RCHowell / rchowell.github.io

https://rchowell.github.io
0 stars 0 forks source link

Fuzzy Prefix Searching with a Trie #4

Open RCHowell opened 4 years ago

RCHowell commented 4 years ago

Exercises in Text Prediction

The goal was to create a word suggestion tool (with some fuzzing) using a trie -- implemented in Go. The final code.

End Result

Success Criteria

Setup

First, I need some words. The system dictionary "/usr/share/dict/words" is pretty good, but it has 236k words whereas this English word frequency dataset from Kaggle has more words as well as their frequencies.

This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.

The data comma-separated "word,freq" but I'm going to convert to json for quick loading (marshaling) into memory (Go map). Also, I'm going to replace the frequency count with the word's commonality ranking because I will be using a minheap and rankings give more context -- knowing how many times a word occurred in the corpus doesn't explicitly tell you its commonality relation to the 333k other words.

# Get list of words for building trie
awk -F , '{print $1}' original-dataset.csv > words.txt

# The words are in order by frequency -> the line number is the word's rank
awk -F , '
  BEGIN {print "{"}
  {printf "  \"%s\": %d,\n", $1, NR}
  END {print "}"}' words.txt > ranks.json

# trim the last trailing comma -- I could've used jq

L0: Prefix Searching

How can we find all words that start with "a"? What about words that start with "pre"?

grep "^a.*" words.txt

grep "^pre.*" words.txt

L1: Trie-ing Harder

A trie is a prefix tree where each node stores a set of child nodes with strings for the set's keys. String values are not stored in the nodes, but nodes are marked when they are the end of a word. For level 1, I will construct a trie from the dictionary (words.txt) and use that trie as my search index.

example trie

type Node struct {
  Children map[string]*Node
  IsWord   bool
}

Trie Construction

This method is straightforward. I iterate over the words, then iterate over the word's characters. I walk the trie character-by-character and create nodes when necessary. The last node corresponds to the last character of the word, so this node is marked with isWord = true.

for word in words.txt
  for character in word
    // traverse trie and set to current node
    if (node does not exist for this character)
      // create node and mark as current
  // mark current node as a word

Prefix Search

Search is similar to construction. Given a prefix, I walk the trie character-by-character. Once I have walked to the end of the prefix, I perform an unbounded search on the sub-trie and accumulate characters to form partial words. If I find a node marked as a word, I add the current value of the accumulator to the results list.

L2: The Great Filter

I can now return all of the words which share a common prefix, but I'd rather have fewer suggestions. I will filter for the most common words.

Word Frequency

Transforming the kaggle dataset into json makes this easy with json.Marshal.

Marshal

func GetWordRanks(ranksFilePath string) map[string]int {
    ranks := make(map[string]int)
    bytes, _ := ioutil.ReadFile(rankFilePath)
    _ = json.Unmarshal(bytes, &ranks)
    return ranks
}

Min Heap

Heaps are great for quickly returning n values in order. Using a heap makes return the most common words simple.

func CommonWords(ranks map[string]int, words []string, n int) []string {

    h := &wordHeap{}
    heap.Init(h)

    // Add all words to the heap log(n) time
    for _, w := range words {
        heap.Push(h, &word{
            text: w,
            rank: ranks[w],
        })
    }

    heapLen := h.Len()
    if heapLen < n {
        n = heapLen
    }

    // Return top n words from the heap
    results := make([]string, n)
    for i := 0; i < n; i++ {
        results[i] = (*h)[i].text
    }
    return results
}

L3: Fuzzy Finding For Fat-Fingered Friends

I will attempt to improve the search by fuzzing prefixes. I can do this by defining a set of boundary characters for each key on the keyboard and searching the trie with those boundary characters. For example, the boundary characters of "a" on a qwerty keyboard are {"q", "w", "s", "z"}. These are the characters that are likely to be accidentally typed instead of "a".

The fuzzy trie search is done, for now, in two steps.

Setup the DFS

I search the trie for each permutation of the given prefix. I permute each character with its boundary characters. Once I reach a permutation that exists in the trie, I add it to the stack for the depth first search.

DFS

The DFS is the exact same as the standard prefix search from before. I search through the trie and add each word I encounter to the results list.

Why this isn't good?

Efficiency aside, this isn't good because the permutations of a prefix result in searches for prefixes that often have more common words than the words for the given prefix.

Example: cat This should return categories, catalog, category, ... in that order. However, "date" is the top result because "dat" is a prefix permutation of "cat" and it is more common than all of the results with the prefix "cat".

This fuzzing method assumes that typing accuracy is evenly distributed amongst the intended character and its boundary characters. In practice, people's typing accuracy is much higher than ~20% (I sure hope it is). I need to account for a permutation's edit distance to the prefix in the results ranking.

L4: Considering Edit Distances

In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. source

In this context, I am only concerned about the number of different characters. More precisely, the Hamming distance.

Fuzzy Searching with Edit Distance and Commonality Ranking Filters

I've added the edit distance filter to Setup the DFS step of the fuzzy trie search. Doing this increases performance because I will not search down trie paths for words I would later filter out.

The rule is simple, yet effective. I only search for words whose permuted prefix has an edit distance less than or equal to 20% of the input prefix's length (with rounding). For example, if you prefix is 6 characters long, the search will include results whose prefix differs by up to 2 characters. In code:

if editDistance(permutedPrefix) <= 0.2 * len(inputPrefix) {
  // add to stack for the DFS
}

This assumes ~80% typing accuracy which seems low for physical keyboards but fair for touch keyboards. Also, keeping the assumed typing accuracy low increases the fuzziness of the search which also helps with spelling mistakes.

L5: Ordered Set as a Finite State Automaton

Read the Burntsushi FSA/FST post! It's magic.

Implementing a finite state automata backed ordered set for searching prefixes is a bit too difficult for the scope of this exercise. However, they are absolutely worth discussing! I've been using a trie which does not combine suffixes. An FSA combines suffixes to the overall data structure would be much smaller. We would also be able to perform better fuzzy searches. Here is the post's excerpt about fuzzy searching an FSA using Levenshtein automata. Note that "Levenshtein Distance" and "Edit Distance" are the same thing, but I only considered Hamming distance in my implementation.

Given a collection of strings, one really useful operation is fuzzy searching. There are many different types of fuzzy searching, but we will only cover one here: fuzzy search by Levenshtein or “edit” distance.

For our purposes, the question we’d like to answer is: does this key match any of the keys in the set up to an Levenshtein distance of n? Certainly, we could implement an algorithm to compute Levenshtein distance between two strings, iterate over the keys in one of our FST based ordered sets, and run the algorithm for each key. If the distance between the query and the key is <= n, then we emit it as a match. Otherwise, we skip the key and move on to the next.

The problem with this approach is that it is incredibly slow. It requires running an effectively quadratic algorithm over every key in the set. Not good.

It turns out that for our specific use case, we can do a lot better than that. Namely, in our case, one of the strings in every distance computation for a single search is fixed; the query remains the same. Given these conditions, and a known distance threshold, we can build an automaton that recognizes all strings that match our query.

Why is that useful? Well, our FST based ordered set is an automaton! That means answering the question with our ordered set is no different than taking the intersection of two automatons. This is really fast.