devongovett / regexgen

Generate regular expressions that match a set of strings
https://runkit.com/npm/regexgen
3.34k stars 101 forks source link

ES6 Escaping Support #15

Closed ticky closed 7 years ago

ticky commented 7 years ago

The underlying jsesc library accepts an es6 parameter, which causes it to output ES6-compatible escapes.

Exposing this option would be useful for, among other things, making the output of regexen more applicable to languages whose backing string format isn’t UTF-16!

ticky commented 7 years ago

Upon further inspection, it seems simply enabling es6 in jsesc doesn’t get around the fact the regexgen algorithm is operating on bytes!

mathiasbynens commented 7 years ago

Regenerate (used in regexgen) is operating on UCS-2/UTF-16-like code units because that’s what JavaScript does too.

For non-{UCS-2,UTF-16} languages, generating the output would be much simpler indeed.

ticky commented 7 years ago

@mathiasbynens yeah, this issue is actually several yaks deep in my trying to find a way to get as accurate an emoji regex as you provide in emoji-regex, but compatible with Ruby (which complains about the surrogate pairs)!

devongovett commented 7 years ago

@mathiasbynens doesn't Regenerate have a unicode es6 mode as well?

ticky commented 7 years ago

So it does! I’ve just whipped together a proof-of-concept branch which implements just that. It’s not pretty, but I’m hoping to verify it does what I need before I polish it further! 😅

ticky commented 7 years ago

It sure seems to work! The trouble is it requires passing the (presumably -u to match the es6 unicode) flag a couple of levels deep. Not sure how you’d prefer to do that, @devongovett!

$ ./bin/cli.js -gu 👩🏻 👩🏻‍🔬
/\u{1F469}\u{1F3FB}(?:\u200D\u{1F52C})?/gu

Note that because obviously it’s not valid under Node yet, It’s had to sidestep using the actual RegExp constructor.

devongovett commented 7 years ago

The unicode flag should work in node 6, which regexgen requires anyway. But you're right, we don't currently pass the flags to the generation phase.

ticky commented 7 years ago

Huh, so it does! I think I was doing something wrong before when I was getting an error!

mathiasbynens commented 7 years ago

doesn't Regenerate have a unicode es6 mode as well?

It does, but I don’t know of any languages that support escape sequences of the form \u{…}, so I thought that wouldn’t be (directly) helpful.

ticky commented 7 years ago

Other than differing semantics for the \x escape (must be two bytes), the ES6-style \u{} escapes work just fine in Ruby, for one! 😄

ticky commented 7 years ago

I’ve added #17, which seems to mostly work™ except for one meta character escaping regression I can’t quite wrap my head around. Any input on that would be most welcome! 😃

ticky commented 7 years ago

I’ve now also created mathiasbynens/emoji-regex#22, which implements #17’s updated behaviour. :smile: