Open mathiasbynens opened 3 years ago
Turns out that .sort()
ing before passing to regexgen doesn't actually fix the issue, it just moves it around (other strings are now no longer matched). So ignore that part of my post β itβs not a workaround that fixes the problem in general (even though it helps in this particular test case).
Simpler test case with ASCII characters only:
const assert = require('assert');
const Trie = require('regexgen').Trie;
const trie = new Trie();
const STRING_TO_MATCH = 'FBCD';
const strings = [
'AGBHD',
'EIBCD',
'EGBCD',
'FBJBF',
'AGBH',
'EIBC',
'EGBC',
'EBC',
'FBC',
'CD',
'F',
'C',
'ABCD',
'EBCD',
STRING_TO_MATCH,
];
// Uncommenting this results in regexgen generating a different pattern
// that passes the tests below (but still produces incorrect results in other cases):
//strings.sort();
trie.addAll(strings);
const pattern = trie.toString();
//console.log(pattern);
// β 'F(?:BJBF)?|(?:E[GI]?B|FB)?CD?|A(?:GBHD?|BCD)'
// Or with sort() first:
// β 'FBJBF|(?:E[GI]?B|FB)?CD|A(?:GBHD?|BCD)|(?:E[GI]?B|FB)?C|F'
const re = new RegExp(pattern, 'g');
assert(strings.includes(STRING_TO_MATCH));
// Verify that every string we told regexgen to match, is actually
// matched by the generated pattern.
for (const string of strings) {
const actual = string.match(re)[0];
assert(string === actual);
}
This patch results in correct output:
diff --git a/src/trie.js b/src/trie.js
index 8e363e1..f938633 100644
--- a/src/trie.js
+++ b/src/trie.js
@@ -42,7 +42,7 @@ class Trie {
* @return {State} - the starting state of the minimal DFA
*/
minimize() {
- return minimize(this.root);
+ return this.root;
}
/**
So (unsurprisingly) the bug is somewhere in minimize()
.
I patched regexgen to allow for easier inspection of its internal state. Hereβs the state for the above test case.
Without sort (incorrect):
A => G => B => H
A => G => B => H => D
A => B => C => D
E => I => B => C
E => I => B => C => D
E => G => B => C
E => G => B => C => D
E => B => C
E => B => C => D
F
F => B => J => B => F
F => B => C
F => B => C => D
C
C => D
==>
F(?:BJBF)?|(?:E[GI]?B|FB)?CD?|A(?:GBHD?|BCD)
With sort (which in the case of this particular test case, gives 100% correct results, matching all the strings completely β so we can use it as a reference):
A => B => C => D
A => G => B => H
A => G => B => H => D
C
C => D
E => B => C
E => B => C => D
E => G => B => C
E => G => B => C => D
E => I => B => C
E => I => B => C => D
F
F => B => C
F => B => C => D
F => B => J => B => F
==>
FBJBF|(?:E[GI]?B|FB)?CD|A(?:GBHD?|BCD)|(?:E[GI]?B|FB)?C|F
F => B => C => D
is the problematic one in this test case β 'FBCD'
is the string that does not get matched, despite being included in the input to regexgen. In the broken output, it doesn't get matched because F(?:BJBF)?
appears first in the pattern, and so only the F
is matched. Within the generated pattern, it should never happen that something on the left matches a prefix of something that's further on the right, because then the latter can never match. It seems regexgen should either
src/minimize.js
), orsrc/regex.js
), or
See https://github.com/mathiasbynens/emoji-test-regex-pattern/issues/1.
There seems to be an issue where regexgen produces incorrect output. The exact output depends on the order of the
Trie#addAll
input, which seems like a bug in and of itself.Test case:
Uncomment the
strings.sort()
line results in a different pattern (which appears to work correctly for this particular case, but it actually still doesn't match all expected strings). Here are the patterns regexgen generates for both cases:I think there's a bug in regexgen: