BurntSushi / aho-corasick

A fast implementation of Aho-Corasick in Rust.
The Unlicense
1.03k stars 93 forks source link

Question: Suggestions for handling . wildcards within query patterns #122

Closed tfwillems closed 1 year ago

tfwillems commented 1 year ago

Hi Andrew,

First and foremost, thanks for creating and maintaining this package - it truly is awesome and incredibly powerful!

I'm currently using the crate to build large automatons containing 100k - 1M DNA patterns, where each pattern is < 25 characters long. These enable me to search large genomic databases/graphs for many query patterns simultaneously.

The text I'm querying is solely comprised solely of ACGT alphabet, and my patterns can contain ACGT or N, where N is a special character that matches every other character.

Do you have a recommendation for how best to handle this type of wildcard (N = . in regexes) within the automaton? My current naive approach is just to enumerate all unambiguous versions of each pattern (e.g. ANG would generate AAG, ACG, AGG and ATG) and store them all in the automaton. This works fairly well when patterns can only contain a few wildcards, but the combinatorics become memory prohibitive in some of the applications I'm exploring.

Any suggestions for how to better handle these wildcards would be much appreciated!

Thomas

BurntSushi commented 1 year ago

Enumerating all cases is what I would suggest if the total number of patterns doesn't get too crazy. If enumerating all of them is infeasible (i.e., would result in more than low millions), then the next suggestion I have would be to search for a common prefix of the set of all enumerated patterns. So for example, if you have ATCNG, then you'd search for ATC and then run another search to confirm whether a match actually exists at that location. This strategy only works if your prefix leads to a low false positive rate of candidates. ATC, for example, is probably short enough that if you're searching DNA, you'll probably have a very high false positive rate. A long prefix doesn't guarantee a low false positive rate on its own, so that's something you'll need to figure out for yourself. And of course, it's possible to combine these approaches. For example, expand any wildcards up to a certain number and then stop and chop off a prefix and search that. This is, in effect, a heuristic. So whether it works for you or not depends.

Otherwise you might just instead build a regex. The problem is that 100k - 1M might wind up being too big for a crate like regex-automata to handle practically.

If none of those work, then you might want to build something bespoke. For example, copy the non-contiguous NFA implementation from this crate and adapt it to support basic wildcards.