cinchapi / concourse

Distributed database warehouse for transactions, search and analytics across time.
http://concoursedb.com
Apache License 2.0
314 stars 235 forks source link

Use Finite State Transducers for more efficient memory representation of search strings #519

Open jtnelson opened 2 months ago

jtnelson commented 2 months ago

An FST may drastically improve memory efficiency at the slight cost of search performance (going from O(1) to O(m)), but the reduction in garbage collection may balance it out .

An FST can be used to map a String to a value.

jtnelson commented 2 months ago

import java.util.HashMap;
import java.util.Map;

/**
 * A node in the Finite State Transducer (FST) that supports value mapping.
 */
class FSTNode<T> {
    Map<Character, FSTNode<T>> edges = new HashMap<>();
    T value = null;

    /**
     * Adds a character transition to this node.
     * 
     * @param ch The character to add.
     * @return The node corresponding to the added character.
     */
    FSTNode<T> addEdge(char ch) {
        return edges.computeIfAbsent(ch, c -> new FSTNode<>());
    }
}

/**
 * A Finite State Transducer (FST) for storing and mapping strings to values.
 */
public class FST<T> {
    private final FSTNode<T> root;

    /**
     * Constructs an empty FST.
     */
    public FST() {
        root = new FSTNode<>();
    }

    /**
     * Inserts a word and its associated value into the FST.
     * 
     * @param word The word to insert.
     * @param value The value to associate with the word.
     */
    public void insert(String word, T value) {
        FSTNode<T> current = root;
        for (char ch : word.toCharArray()) {
            current = current.addEdge(ch);
        }
        current.value = value;
    }

    /**
     * Searches for a word in the FST and returns its associated value.
     * 
     * @param word The word to search for.
     * @return The value associated with the word, or null if the word is not found.
     */
    public T search(String word) {
        FSTNode<T> current = root;
        for (char ch : word.toCharArray()) {
            current = current.edges.get(ch);
            if (current == null) {
                return null;
            }
        }
        return current.value;
    }

    /**
     * Searches for a prefix in the FST.
     * 
     * @param prefix The prefix to search for.
     * @return True if the prefix is found, false otherwise.
     */
    public boolean startsWith(String prefix) {
        FSTNode<T> current = root;
        for (char ch : prefix.toCharArray()) {
            current = current.edges.get(ch);
            if (current == null) {
                return false;
            }
        }
        return true;
    }

    /**
     * A main method to demonstrate the usage of the FST.
     */
    public static void main(String[] args) {
        FST<Integer> fst = new FST<>();

        // Insert words and their associated values into the FST
        fst.insert("apple", 1);
        fst.insert("app", 2);
        fst.insert("apply", 3);
        fst.insert("apt", 4);

        // Search for words in the FST and print their associated values
        System.out.println(fst.search("apple")); // 1
        System.out.println(fst.search("app")); // 2
        System.out.println(fst.search("apply")); // 3
        System.out.println(fst.search("apt")); // 4
        System.out.println(fst.search("ape")); // null

        // Check for prefixes in the FST
        System.out.println(fst.startsWith("ap")); // true
        System.out.println(fst.startsWith("apl")); // true
        System.out.println(fst.startsWith("ape")); // false
    }
}
jtnelson commented 2 months ago

Even if all the substrings are already broken down and added to the FST one by one, the FST will still use shared nodes. This is one of the fundamental advantages of the FST data structure: it inherently shares common prefixes (and potentially suffixes) among different terms, which allows for significant memory savings.

How Shared Nodes Work in FST

When you add substrings to an FST, the following happens:

1.  Shared Prefixes: If a new substring shares a prefix with an existing substring, the FST reuses the existing nodes for the shared prefix.
2.  New Nodes: Only the non-overlapping part of the substring (i.e., the unique suffix) requires new nodes to be created.

Example: Adding Substrings One by One

Consider the term “apple”:

•   All substrings of “apple”: “a”, “ap”, “app”, “appl”, “apple”, “p”, “pp”, “ppl”, “pple”, “l”, “le”, “e”.

Adding Substrings to FST

1.  Add “a”:
•   Create nodes: a
2.  Add “ap”:
•   Reuse node ‘a’.
•   Create new node: p
3.  Add “app”:
•   Reuse nodes: a -> p.
•   Create new node: p
4.  Add “appl”:
•   Reuse nodes: a -> p -> p.
•   Create new node: l
5.  Add “apple”:
•   Reuse nodes: a -> p -> p -> l.
•   Create new node: e
6.  Add “p”:
•   Create new node: p
7.  Add “pp”:
•   Reuse node: p.
•   Create new node: p
8.  Add “ppl”:
•   Reuse nodes: p -> p.
•   Create new node: l
9.  Add “pple”:
•   Reuse nodes: p -> p -> l.
•   Create new node: e
10. Add “l”:
•   Create new node: l
11. Add “le”:
•   Reuse node: l.
•   Create new node: e
12. Add “e”:
•   Create new node: e

Final FST Structure

The resulting FST will look like this:

a
|
p
|
p
/ \ l p e l
  e
p   l
 \ / \
  p   e
   \ / \
    l   e
     \
      e

Memory Usage Comparison

•   HashMap: Stores each substring independently, resulting in high memory usage due to redundant storage and overhead for each entry.
•   FST: Reuses nodes for shared prefixes, significantly reducing the number of nodes and edges required.

Summary

Even when substrings are added one by one, the FST still benefits from shared nodes due to its structure. This results in substantial memory savings compared to a HashMap, where each substring must be stored independently. The FST’s ability to share nodes for common prefixes makes it highly efficient for storing large sets of related strings, such as all substrings of a given term.

jtnelson commented 2 months ago

https://github.com/yintaoxue/read-open-source-code/blob/master/lucene-4.7.2/src/org/apache/lucene/util/fst/FST.java

As used in Lucene