devongovett / regexgen

Generate regular expressions that match a set of strings
https://runkit.com/npm/regexgen
3.34k stars 100 forks source link

regexgen bug in minimize() / output depends on order of Trie#addAll input #31

Open mathiasbynens opened 3 years ago

mathiasbynens commented 3 years ago

See https://github.com/mathiasbynens/emoji-test-regex-pattern/issues/1.

There seems to be an issue where regexgen produces incorrect output. The exact output depends on the order of the Trie#addAll input, which seems like a bug in and of itself.

Test case:

const assert = require('assert');

const Trie = require('regexgen').Trie;
const trie = new Trie();

const strings = [
  '\u{1F9D1}\u{1F3FF}\u200D\u2695\uFE0F',
  '\u{1F468}\u{1F3FE}\u200D\u2708\uFE0F',
  '\u{1F468}\u{1F3FF}\u200D\u2708\uFE0F',
  '\u{1F469}\u200D\u2764\u200D\u{1F469}',
  '\u{1F9D1}\u{1F3FF}\u200D\u2695',
  '\u{1F468}\u{1F3FE}\u200D\u2708',
  '\u{1F468}\u{1F3FF}\u200D\u2708',
  '\u{1F468}\u200D\u2708',
  '\u{1F469}\u200D\u2708',
  '\u2708\uFE0F',
  '\u{1F469}',
  '\u2708',
];

const STRINGS_THAT_MUST_BE_INCLUDED = [
  '\u{1F9D1}\u200D\u2708\uFE0F', // πŸ§‘β€βœˆοΈ E12.1 pilot
  '\u{1F468}\u200D\u2708\uFE0F', // πŸ‘¨β€βœˆοΈ E4.0 man pilot
  '\u{1F469}\u200D\u2708\uFE0F', // πŸ‘©β€βœˆοΈ E4.0 woman pilot
];

strings.push(...STRINGS_THAT_MUST_BE_INCLUDED);

// Uncommenting this results in regexgen generating a different pattern
// that passes the tests below, but is wrong in a different way:
//strings.sort();

trie.addAll(strings);
const pattern = trie.toString();
console.log(pattern);

const re = new RegExp(pattern, 'g');
assert(strings.includes('\u{1F9D1}\u200D\u2708\uFE0F')); // πŸ§‘β€βœˆοΈ E12.1 pilot
assert(strings.includes('\u{1F468}\u200D\u2708\uFE0F')); // πŸ‘¨β€βœˆοΈ E4.0 man pilot
assert(strings.includes('\u{1F469}\u200D\u2708\uFE0F')); // πŸ‘©β€βœˆοΈ E4.0 woman pilot
const text = `
\u{1F9D1}\u200D\u2708\uFE0F  # E12.1 pilot
\u{1F468}\u200D\u2708\uFE0F  # E4.0 man pilot
\u{1F469}\u200D\u2708\uFE0F  # E4.0 woman pilot
`;
console.log('3 pilots expected', text.match(re));
// β†’ expected: 3 pilots expected [ 'πŸ§‘β€βœˆοΈ', 'πŸ‘¨β€βœˆοΈ', 'πŸ‘©β€βœˆοΈ' ]
// β†’ actual: 3 pilots expected [ 'πŸ§‘β€βœˆοΈ', 'πŸ‘¨β€βœˆοΈ', 'πŸ‘©', '✈️' ]

Uncomment the strings.sort() line results in a different pattern (which appears to work correctly for this particular case, but it actually still doesn't match all expected strings). Here are the patterns regexgen generates for both cases:

# Without sort(), not working correctly:
\uD83D\uDC69(?:\u200D\u2764\u200D\uD83D\uDC69)?|\uD83E\uDDD1(?:\uD83C\uDFFF\u200D\u2695\uFE0F?|\u200D\u2708\uFE0F)|(?:\uD83D\uDC68(?:\uD83C[\uDFFE\uDFFF])?\u200D|\uD83D\uDC69\u200D)?\u2708\uFE0F?
# With sort(), seemingly working correctly:
\uD83D\uDC69\u200D\u2764\u200D\uD83D\uDC69|(?:\uD83E\uDDD1\uD83C\uDFFF\u200D\u2695|\uD83D\uDC68(?:\uD83C[\uDFFE\uDFFF])?\u200D\u2708|\uD83D\uDC69\u200D\u2708|\u2708)\uFE0F?|\uD83E\uDDD1\u200D\u2708\uFE0F|\uD83D\uDC69

I think there's a bug in regexgen:

mathiasbynens commented 3 years ago

Turns out that .sort()ing before passing to regexgen doesn't actually fix the issue, it just moves it around (other strings are now no longer matched). So ignore that part of my post β€” it’s not a workaround that fixes the problem in general (even though it helps in this particular test case).

mathiasbynens commented 3 years ago

Simpler test case with ASCII characters only:

const assert = require('assert');

const Trie = require('regexgen').Trie;
const trie = new Trie();

const STRING_TO_MATCH = 'FBCD';
const strings = [
  'AGBHD',
  'EIBCD',
  'EGBCD',
  'FBJBF',
  'AGBH',
  'EIBC',
  'EGBC',
  'EBC',
  'FBC',
  'CD',
  'F',
  'C',
  'ABCD',
  'EBCD',
  STRING_TO_MATCH,
];

// Uncommenting this results in regexgen generating a different pattern
// that passes the tests below (but still produces incorrect results in other cases):
//strings.sort();

trie.addAll(strings);
const pattern = trie.toString();
//console.log(pattern);
// β†’ 'F(?:BJBF)?|(?:E[GI]?B|FB)?CD?|A(?:GBHD?|BCD)'
// Or with sort() first:
// β†’ 'FBJBF|(?:E[GI]?B|FB)?CD|A(?:GBHD?|BCD)|(?:E[GI]?B|FB)?C|F'

const re = new RegExp(pattern, 'g');
assert(strings.includes(STRING_TO_MATCH));

// Verify that every string we told regexgen to match, is actually
// matched by the generated pattern.
for (const string of strings) {
  const actual = string.match(re)[0];
  assert(string === actual);
}
mathiasbynens commented 3 years ago

This patch results in correct output:

diff --git a/src/trie.js b/src/trie.js
index 8e363e1..f938633 100644
--- a/src/trie.js
+++ b/src/trie.js
@@ -42,7 +42,7 @@ class Trie {
    * @return {State} - the starting state of the minimal DFA
    */
   minimize() {
-    return minimize(this.root);
+    return this.root;
   }

   /**

So (unsurprisingly) the bug is somewhere in minimize().

mathiasbynens commented 3 years ago

I patched regexgen to allow for easier inspection of its internal state. Hereβ€˜s the state for the above test case.

Without sort (incorrect):

A => G => B => H
A => G => B => H => D
A => B => C => D
E => I => B => C
E => I => B => C => D
E => G => B => C
E => G => B => C => D
E => B => C
E => B => C => D
F
F => B => J => B => F
F => B => C
F => B => C => D
C
C => D

==>
F(?:BJBF)?|(?:E[GI]?B|FB)?CD?|A(?:GBHD?|BCD)

With sort (which in the case of this particular test case, gives 100% correct results, matching all the strings completely β€” so we can use it as a reference):

A => B => C => D
A => G => B => H
A => G => B => H => D
C
C => D
E => B => C
E => B => C => D
E => G => B => C
E => G => B => C => D
E => I => B => C
E => I => B => C => D
F
F => B => C
F => B => C => D
F => B => J => B => F

==>
FBJBF|(?:E[GI]?B|FB)?CD|A(?:GBHD?|BCD)|(?:E[GI]?B|FB)?C|F

F => B => C => D is the problematic one in this test case β€” 'FBCD' is the string that does not get matched, despite being included in the input to regexgen. In the broken output, it doesn't get matched because F(?:BJBF)? appears first in the pattern, and so only the F is matched. Within the generated pattern, it should never happen that something on the left matches a prefix of something that's further on the right, because then the latter can never match. It seems regexgen should either