devongovett / regexgen

Generate regular expressions that match a set of strings
https://runkit.com/npm/regexgen
3.34k stars 101 forks source link

Strange behaviour with Cyrillic regexes #20

Open vikanezrimaya opened 6 years ago

vikanezrimaya commented 6 years ago

Hello! Used your library via Minta Electron app and noticed some strange lack of optimization when it comes to working with Cyrillic alphabet.

Regexes generated are cumbersome and bulky. For example (I converted unicode codepoints to Russian letters for convenience):

/ня(?:[кн]!|[кн])|Ня(?:[кн]?!|[кн])?/ (well, I know that detecting a substring nya is silly, but it is a perfect test case - short, memorable and permutable)

Which could be minimized to the following: /[Нн]я(?:[кн][!\?]|[кн]|[!\?])?/

Seems like the generator cannot understand that я (cyrillic ya) in both Uppercase and lowercase strings is the same letter (and it is indeed, which is verified by looking at Unicode code points generated and outputted) I will be glad to provide more data if you need it.