Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.57k stars 109 forks source link

Fuzzy matching with regex expressions #308

Closed ChangqingW closed 10 months ago

ChangqingW commented 10 months ago

Compiled from the latest git commit, echo 'firstxxxxxcond' | ug -Z4 'first.+second' returned no match for me, presumably because ugrep is forcing the first character of the second part of the regex expression to be an match? Is there any way to get around this without using '?' excessively at the start of the second part of the expression?

genivia-inc commented 10 months ago

I can see that this might surprise you. It surprised me a bit too, initially. But this pattern has many possibilities to fuzzy match first.+second with the wildcard sequence .+. The way it approaches this match is to first match first.+ with all of firstxxxxxcond like any matcher, then it can't match second so the fuzzy-matching kicks in. Backtracking takes 6 more steps to find the approximate match of second that is 6 characters long. Ideally, the minimal edit distance is 2 for this pattern, but since it has a wildcard .+ it gobbles up more text which it has to "give up" to try to fuzzy match the rest. That's why -Z6 succeeds to match but -Z4 does not.

Perhaps a tweak to the fuzzy-match algorithm can be made to not count backtracking steps to undo exact matches in a DFA back edge (a loop) like the .+. It could be done by "undoing" each back edge iteration, until the rest fuzzy matches. But that seems computationally expensive, since we need to stack the state of the DFA for all iterations to be able to undo, not just the number of edit distances to backtrack/undo (e.g. not just 2 or 4).

Fuzzy matching works best when the patterns are a bit more specific than the wildcards based on . such as .+ and .*. A single . or a few . here and there should be fine, but those will affect the algorithm in surprising ways.

ChangqingW commented 10 months ago

Thanks for the explanation, time / memory complexity with .* do seem very problematic. In my case we are working with searching lines with "words" in specific orders, I wonder if this is possible with ugrep, or if you are aware of tools that might support this? E.g. searching for lines with first, second, third, in that order, with an edit-distance for each word instead of for all words together.

I imagine this is possible by having ugrep output the match position, searching for the first word, piping the match to sed / awk to trim until the end of the match, and pipe to ugrep again for subsequent words. This is a bit convoluted but still acceptable.

genivia-inc commented 10 months ago

Try ug -% -Z4 "first second third" to fuzzy-find matches with a Boolean pattern (space is AND). But note that these words are found in any order on a line. The match positions with -k or -b can help identify the order, together with -u to ungroup the matches from the line. Formatted output can help to output a format that a second tool/script then uses to check the order of the matches to filter out unordered matches, e.g. --format="(%k)%O%~".

genivia-inc commented 10 months ago

Closing this for now. Feel free to reopen if you have further questions.