anusarati / strprox

Apache License 2.0
3 stars 2 forks source link

Terrible worst case performance, almost halting #1

Open planetoryd opened 8 months ago

planetoryd commented 8 months ago

I tested on a dataset with 1442910 words.

#[test]
fn test_worse_case() -> Result<()> {
    let case = "bring more land under cultivation";
    let conf = crate::config::get_config();
    let db_path = PathBuf::from(conf.data_path.clone());
    let db = offdict::<Strprox>::open_db(db_path)?;
    println!("testing");
    db.search(case, 3, false)?;
    Ok(())
}

I ran it with an input this long and it wouldnt finish.

https://github.com/planetoryd/offdict/tree/strprox

I made unrelated changes to your code. https://github.com/planetoryd/strprox/commits/master/

planetoryd commented 8 months ago

Also you'd be interested in https://github.com/Validark/DynSDT

planetoryd commented 8 months ago

I fixed it with https://github.com/planetoryd/strprox/commit/3a0321bf5ad9d8120d3aeda03310e1da0ed79971

not graceful tho

anusarati commented 8 months ago

Hi, thanks for letting me know. I'm a little busy right now but I'll check it out later.

anusarati commented 8 months ago

@planetoryd I'm checking to see if there's something wrong with my implementation compared to the paper. However, the paper does not have experimental results for queries with more than 10 characters, and it suggests segmenting queries into words in combination with techniques from other papers.

In the meantime, I'd like to merge your commits upstream if you don't mind.

planetoryd commented 8 months ago

Is it possible to implement this algorithm with https://crates.io/crates/fst . IDK

anusarati commented 8 months ago

Looks interesting. I'm looking into it.

planetoryd commented 8 months ago

image

image

Is this a feature, or an inherent problem of this algorithm ? I stop the search when the threshold gets too big, and that's result, which is embarrassing.

Perhaps I am not using it correctly. And I should probably introduce other algorithms to nullify its downsides, like pruning radix trie for exact matches before fuzzy is needed

planetoryd commented 8 months ago

image

image

with more lenient caps.

planetoryd commented 8 months ago

With that kind of performance it's hard to say what the point is. I can get threshold filtered results with FST with levenshtein-distance of one (which is effectively the same as top-k because its top one)

anusarati commented 8 months ago

The algorithm will always return the requested number of strings if there are enough strings in the dataset, whereas using a threshold of one may return less than that. In this case, it seems that it had to increase the threshold to get more strings, and if you used the threshold of 0, it would only return 1 string rather than 4.

A simple way to implement a top-k algorithm using FST is to try getting strings using Levenshtein automata beginning with an edit distance threshold of 0, then incrementally increasing the threshold by 1 if k strings weren't found. However, reconstructing Levenshtein automata and using the intersection operation incurs the overhead of reiterating through states. I haven't measured the performance of that approach and how it compares to my current implementation. I think reiteration can be avoided by using a Levensthein automaton with a threshold equal to the query length, and buffering states whose edit distances are higher than the current dynamic threshold to check later if less than k strings were found, but this may require a different implementation than the one currently in fst.

I have been working on an algorithm using FST but without Levenshtein automata.

planetoryd commented 8 months ago

oh thats true. I figured out but forgot about that.

do you have an idea on the impl of cache. some kind of LRU should suffice, but its a lil bit different because of the structure.

anusarati commented 8 months ago

For my current implementation of META, the matching set, threshold, and the associated query prefix can be cached to resume searching through autocomplete_step when the user continues typing if the requested number of strings is the same. This can be done for each prefix of a query, and if the user deletes part of the query and continues typing, the cache for the unchanged prefix of the query can be accessed, and the cache for the prefixes of the previous query longer than that can be deleted or updated. The cache can be implemented using a vector where each index corresponds to a prefix length and each element has a cache entry/entries.

planetoryd commented 8 months ago

I would like it return if there are results that are close and cheap, and not like the top-k manner which is expensive in practice. I'm not sure how to do it well.

  while query_prefix_len <= query_chars.len() {
            debug_println!(
                "{}/{}, active matching set, {}, threshold, {}. results {}/{}",
                query_prefix_len,
                query_chars.len(),
                active_matching_set.matchings.len(),
                threshold,
                result.len(),
                requested
            );
            result.clear();
            let mut set_delta = Default::default();
            let proposed_threshold = self.autocomplete_step(
                &mut active_matching_set,
                &query_chars[..query_prefix_len],
                query_prefix_len,
                threshold,
                requested,
                &mut result,
                &mut set_delta,
            );
           if set_delta.matchings.len() > MAX_SET_DELTA || proposed_threshold > MAX_DISTANCE {
                if requested == 1 {
                    return vec![];
                }
                query_prefix_len = 1;
                requested = 1;
                threshold = 1;
                active_matching_set = MatchingSet::<'stored, UUU, SSS>::new(&self.trie);
                continue;
            }
            threshold = proposed_threshold;
            active_matching_set.matchings.extend(set_delta.matchings);
            query_prefix_len += 1;
        }

The code above is naive

planetoryd commented 8 months ago

There is usually one result within distance of one, and in worse cases the remaining of top-K requires searching through distance of 3, 4.

anusarati commented 8 months ago

I made some changes in the worst-case branch. It exposes a threshold_topk method that requires the top-k results to have a PED within a certain threshold, which also lets it return less than k results.

I also added an algorithm using FST. It basically scans it but attempts to do early-termination for PED calculations based on the threshold, and shares computations between characters that do not match anything in the query.

The FST algo has better performance than my implementation of META when the top-k includes high PEDs (it can handle your long query test in less than 1 second).

FST, long query

fst long query

Current META implementation, long query

meta long query

However, the META implementation is faster for reasonable thresholds using threshold_topk or when the top-k's PEDs are low (such as in the bounded peds test), and when the PEDs are really high then it's faster to use the plain unindexed_autocomplete. Take a look at the bench_noise test, which uses randomly-generated data from make_noise.

bench_noise

planetoryd commented 8 months ago

I made some minor changes https://github.com/planetoryd/strprox/tree/worst-case

I'd clarify that in practical use it's about returning results as soon as they are produced, which is unlike a hard-coded threshold. So It should be a streaming iterator that gradually returns more distant strings. The ideal interface is to be an iterator, rather than top-k or threshold.

I also found https://www.sciencedirect.com/science/article/pii/S111001682030260X which claims to be an improvement over META

In this paper, we propose an efficient algorithm, named Depth-Based Fuzzy Auto-Completion (DFA), that avoids this redundant work and minimizes the computations to find the appropriate fuzzy auto-completion of the queries. The main idea of DFA is to store the Matching Nodes into a min heap (sorted based on the prefix edit distance with the query string) and retrieve them one by one until the target k is reached.

planetoryd commented 8 months ago

I wonder if it's possible to add other metrics like prioritizing more frequently used words.

image

Linux should be a much better answer

anusarati commented 8 months ago

I also found https://www.sciencedirect.com/science/article/pii/S111001682030260X which claims to be an improvement over META

In this paper, we propose an efficient algorithm, named Depth-Based Fuzzy Auto-Completion (DFA), that avoids this redundant work and minimizes the computations to find the appropriate fuzzy auto-completion of the queries. The main idea of DFA is to store the Matching Nodes into a min heap (sorted based on the prefix edit distance with the query string) and retrieve them one by one until the target k is reached.

I had attempted to use that algorithm before META, but it didn't make sense to me. The specific example I found was that when evaluating the string "orange" against a query "oange," when there's a matching node <"o", 1, 0, 1, 0> for the first "o"-s, in CHECK-DESCENDANTS the condition for descendant nodes is that the depth is less than or equal to 1 + 0 + 1 = 2, but the depth of the node with "a" in "orange" is 3, so it doesn't consider "orange".

planetoryd commented 8 months ago

Do you have any notes or something I wanna save some time.

I tried to read that paper, and took some notes in Typst https://github.com/planetoryd/notes/blob/master/topk_autocompletion.typ

I have some questions, and they omitted the proofs. I'm not sure how the search maintains correctness.

anusarati commented 8 months ago

Sorry, my notes aren't detailed. If you prefer, for specific questions you can reach out to me (xngzhli@gmail.com) or send me contact info for a communications platform like Discord.

The current META implementation does not use Compact Tree (Section 5), which leads to redundant node accesses. There may be a significant difference in performance if it had Compact Tree.

first_deducing collects the minimum DEDs of a node in a map to get the real ED (Lemma 1 3) to add new matchings.

active_matching_set was a misnomer on my part because the b-matching set isn't necessarily an active matching set. Due to the subset relation between a b-matching set and the next, matchings are never pruned away from it and the PED of a matching may increase for the next query prefix while the max top-k PED may not.

I'd clarify that in practical use it's about returning results as soon as they are produced, which is unlike a hard-coded threshold. So It should be a streaming iterator that gradually returns more distant strings. The ideal interface is to be an iterator, rather than top-k or threshold.

I'm thinking about doing this using an intersection between a nondeterministic Levenshtein automaton (to avoid the construction cost of a deterministic one) and an FST dataset, where states that have higher PEDs than the current target are buffered and checked later once there are no more candidates for the target PED.

anusarati commented 8 months ago

The current META implementation does not use Compact Tree (Section 5), which leads to redundant node accesses. There may be a significant difference in performance if it had Compact Tree.

Actually, the paper doesn't seem to use Compact Tree for top-k, since the complexity for FirstDeducing is analogous to MatchingBasedFramework.

planetoryd commented 8 months ago

I mean the invariants of this algorithm.

image

I don't know what makes sure that it is visiting enough nodes and why what it doesn't visit certainly doesn't lead to results.

Have you heard of Lean4, or Dafny ? It'd be nice to have some machine checked proofs, so we can play with it and not break correctness.

I've been following some Rust formal verification tools

Both are not even usable yet. Prusti needs one PR of std lib merged, and Verus is pre alpha or like that.

(Yeah as noted in my note it takes the previous set in entirety)

(It may be of some value to others if we keep convos here)

(You are certainly going to be interested in Lean4. It has a steep learning curve. I barely played with it)

anusarati commented 8 months ago

I don't have experience with proof assistants yet, although Lean does look interesting.

A b-matching set contains all matchings whose edit distance between a query prefix and the stored prefix is within b.

$\cal{P}(q_i, b_i) = \set{ m | m.i \leq i \land \text{ED}(q[1..m.i], m.n) \leq b_i }$

b is equal to the max PED of the top-k results for the current query prefix by definition.

$|R_i| = k$

$\forall s \in R_i, \forall x \in \cal{T}, x \notin R_i \implies \text{PED}(q_i, s) \leq \text{PED}(q_i, x)$

$\exists s \in R_i, \text{PED}(q_i, s) = b_i$

$\forall s \in R_i, \text{PED}(q_i, s) \leq b_i$

The max PED for an empty query is 0 because all strings have an empty prefix that matches perfectly.

$\forall s \in \cal{T}, \text{PED}(q_0, s) = \text{PED}(\text{""}, \text{""}) = 0$

$\cal{P}(q_0, 0) = \set{<0,\text{root},0>}$

Adding a character to the query changes the prefix edit distance from the strings whose $\text{PED} \leq b$ by 0 if the next character after the best prefix matches and 1 otherwise.

$bi \in [b{i-1}, b_{i-1} + 1]$ (Lemma 4)

Therefore the next b-matching set is a superset of the previous.

$\forall m \in \cal{P}(q{i-1}, b{i-1}), m.i \leq i-1 < i \land m.ed \leq bi \leq b{i-1} + 1$

$\therefore \forall m \in \cal{P}(q{i-1}, b{i-1}), m \in \cal{P}(q_i, b_i)$.

Also a b-matching set with a higher b is a superset of one with a lower b.

$\forall m \in \cal{P}(q_i, b - 1), m \in \cal{P}(q_i, b)$

This means that

$\forall m \in \cal{P}(qi, b{i-1}), m.i < i \implies m \in \cal{P}(q{i-1}, b{i-1})$

$bi - 1 \leq b{i-1}$, so $\forall m \in \cal{P}(q_i, bi-1), m.i < i \implies m \in \cal{P}(q{i-1}, b_i-1)$ (1.1)

$m.i = i \implies m \notin \cal{P}(q_{i-1}, b_i-1)$ because $i > i-1$

Except for $<0,\text{root},0>$, the edit distances for all matchings are calculated using matchings for previous prefixes by Lemma 3. The edit distance must be greater than or equal to the previous by the definition of Levenshtein edit distance. All the previous matchings used in Lemma 3 for $\cal{P}(q_i, bi-1)$ are in $\cal{P}(q{i-1}, bi-1)$ because $\forall m \in \cal{M}(q{i-1}, s[1, j-1]), m.ed \leq bi - 1 \implies m \in \cal{P}(q{i-1}, b_i-1)$, so they are used to find new matchings (1.2). This results in the entirety of $\cal{P}(q_i, b_i-1)$ when the descendant traversal checks all nodes that match the new character using the inverted index and can have an edit distance within $b_i - 1$ based on the depth difference.

For the computation of $\cal{P}(q_i, b_i) - \cal{P}(q_i, b_i - 1)$ (2), a similar process is used, except not all the previous matchings may be available in $\cal{P}(q_i, b_i - 1)$. When the previous matching isn't in $\cal{P}(q_i, b_i - 1)$, the previous matching also has an edit distance of $b_i$ because edit distance does not decrease, so descendant traversal simply repeats for new matchings, and this way the entirety of $\cal{P}(q_i, b_i) - \cal{P}(q_i, b_i - 1)$ is found.

anusarati commented 5 months ago

Hi, sorry for the delay. I finally implemented non-deterministic Levenshtein automata! They can be used for the streaming behavior you wanted, however they are still somewhat slow and memory-expensive.

NFA Noise Test

Constructing deterministic Levenshtein automata for large edit distances isn't feasible because its space complexity is exponential in edit distance.

I think a possible solution for your use case is to segment all English strings into words, map each word to a character, and create an index using strings formed by the mapped characters. Then transform each word in the query to the nearest word in the dataset, map those words to characters the same way, and use the new string formed by those characters with error-tolerant autocomplete.

We could also consider heuristics or existing methods/alternative metrics like n-gram/Jaccard similarity.

planetoryd commented 5 months ago

I've been working on an alternative implementation of META. I stack FirstDeducing and SecondDeducing in a different way. This should reduce each set that is being scanned to minimum

image

It's still in the works.

I'm not working on it right now, but busy with some other personal stuff

planetoryd commented 4 months ago

image

there are 1442910 words in the index

image

planetoryd commented 4 months ago

@anusarati

planetoryd commented 4 months ago

image

anusarati commented 4 months ago

Nice. I'm still busy with other stuff but I'll take a closer look later.

planetoryd commented 4 months ago

image

There is some minor error here. I'll be working on it

planetoryd commented 4 months ago

image

planetoryd commented 4 months ago

image

Some parameters need tuning

planetoryd commented 4 months ago

image (release build)

It's mostly usable as my daily tool. Prolly has some bugs but now I have to work on something else.

planetoryd commented 4 months ago

I believe my implementation is mostly correct by observation. Need to iron out some bugs when i have time