BurntSushi / fst

Represent large sets and maps compactly with finite state transducers.
The Unlicense
1.77k stars 125 forks source link

Fst made of fixed length byte strings: bounded hamming distance query #51

Closed pombredanne closed 4 years ago

pombredanne commented 6 years ago

I have this problem: 3B k/v where the key are fixed length bytes strings either 32, 64 or 128 bits (in a given fst all keys have the same length though).

I need to find all values with a key that is within a bounded (small) bit hamming distance of existing keys.

I was thinking of a simple hamming distance query approach based on "Detecting Near-Duplicates for Web Crawling" at http://gurmeet.net/computer-science/research-papers/

So say I want a hamming distance of 3, and my keys are 32 bits. At query time, I could build a query levensthein fst that would allow any key with up to 3 byte to be changed (not added/removed) to still match, e.g. this would match all the keys that that are within hamming distance 3 or less.

This should work, but is this a proper use of the levensthein automaton? e.g. would there be a better construction for hamming distance than levensthein?

Since this would be raw bytes, this might not work well with the unicode support of levensthein?

Would the fact that the first three bytes can be anything with the last byte still match force a full sequential scan of the fst?

@fulmicoton ping too as you seem to grok levensthein automatons quite well

fulmicoton commented 6 years ago

There might exist a data size/distribution where using an fst and an automaton makes sense.

But you should not use a levenshtein automaton. Hamming is also a regular language and building it's DFA is much simpler. It will have O(Max distance * byte Len)

pombredanne commented 6 years ago

@fulmicoton Thank you. Would you have any pointer to hamming automaton constructions?

There might exist a data size/distribution where using an fst and an automaton makes sense.

I am still pondering this. The bytes are mostly random (these are LSHs), and in the case of a 32 bits range, most of the ~4 Billion values will be present. So if this so, a bitmap index and might be simpler, smaller and faster? In the case of 64 bits or longer keys with a sparse, mostly random distribution, then the fst could likely work though two or more bitmaps could be used too. sigh

BurntSushi commented 6 years ago

@pombredanne Thanks for filing this ticket! Unfortunately, I don't have any answers off the top of my head. (Everything you and @fulmicoton have said seems reasonable to me.) I think my advice to you would be to just go out and experiment with your ideas and see what works well. It would be great if you reported back on what you find. :-)

fulmicoton commented 6 years ago

I didn't test it properly but the automaton would look something like this:

extern crate fst;
use fst::Automaton;

const MAX_DISTANCE: usize = 4; // in fact max + 1

fn state_id(offset: usize, distance: usize) -> usize {
    offset * MAX_DISTANCE + distance
}

struct HammingDFA {
    transitions: Vec<[usize; 256]>,
    first_accepting_state: usize,
    sink_state_id: usize,
}

impl HammingDFA {
    fn build(target: &[u8]) -> HammingDFA {
        let sink_state_id = state_id(target.len() + 1, 0);
        let mut transitions = vec![[sink_state_id; 256]; sink_state_id + 1];
        for (offset, target_byte) in target.iter().cloned().enumerate() {
            for input_byte in 0..256 {
                let distance = (target_byte ^ (input_byte as u8)).count_ones() as usize;
                for prev_distance in 0..3 {
                    let previous_state = state_id(offset, prev_distance);
                    let new_distance: usize = prev_distance + distance;
                    let dest_state =
                        if new_distance >= MAX_DISTANCE {
                            sink_state_id
                        } else {
                            state_id(offset + 1, new_distance)
                        };
                    transitions[previous_state][input_byte] = dest_state;
                }
            }
        }
        HammingDFA {
            transitions,
            first_accepting_state: state_id(target.len(), 0),
            sink_state_id,
        }
    }
}

impl Automaton for HammingDFA {
    type State = usize;

    fn start(&self) -> Self::State {
        0
    }

    fn is_match(&self, state: &Self::State) -> bool {
        (*state >= self.first_accepting_state) && (*state < self.first_accepting_state + MAX_DISTANCE)
    }

    fn accept(&self, state: &Self::State, byte: u8) -> Self::State {
        self.transitions[*state][byte as usize]
    }

    fn can_match(&self, state: &Self::State) -> bool {
        *state != self.sink_state_id
    }
}

If I didn't mess it up to much, you can intersect it with your fst. Construction should be super fast. Intersection should be much faster than with a levenshtein automaton, as it will cut branches as soon as as you reach a hamming distance > 3.

pombredanne commented 6 years ago

@BurntSushi I will for sure report here and this is all for an open source project of mine... :) If I am falling in love with automata, that's all your fault in the first place so I will keep you posted.

@fulmicoton Thank you very much! this should give me enough explosive fuel to start my explorations

fulmicoton commented 6 years ago

@pombredanne Sure. Let me know if you need info about the code, or it is buggy and you need assistance in debugging it. To be clear, this automaton will be way faster then using levenshtein automaton and you do not have to do any kind of postprocessing. It outputs stuff at a hamming distance less or equal to 3. (You can also add a method to output the actual hamming distance if you want.)

pombredanne commented 6 years ago

@fulmicoton I will simmer through this over the holidays. I am a pythonista, not yet a rustacean ;)

BurntSushi commented 4 years ago

Closing this as stale. @pombredanne I hope you were successful in your endeavor!