Closed ticky closed 7 years ago
Upon further inspection, it seems simply enabling es6
in jsesc
doesn’t get around the fact the regexgen algorithm is operating on bytes!
Regenerate (used in regexgen) is operating on UCS-2/UTF-16-like code units because that’s what JavaScript does too.
For non-{UCS-2,UTF-16}
languages, generating the output would be much simpler indeed.
@mathiasbynens yeah, this issue is actually several yaks deep in my trying to find a way to get as accurate an emoji regex as you provide in emoji-regex, but compatible with Ruby (which complains about the surrogate pairs)!
@mathiasbynens doesn't Regenerate have a unicode es6 mode as well?
So it does! I’ve just whipped together a proof-of-concept branch which implements just that. It’s not pretty, but I’m hoping to verify it does what I need before I polish it further! 😅
It sure seems to work! The trouble is it requires passing the (presumably -u
to match the es6 unicode) flag a couple of levels deep. Not sure how you’d prefer to do that, @devongovett!
$ ./bin/cli.js -gu 👩🏻 👩🏻🔬
/\u{1F469}\u{1F3FB}(?:\u200D\u{1F52C})?/gu
Note that because obviously it’s not valid under Node yet, It’s had to sidestep using the actual RegExp constructor.
The unicode flag should work in node 6, which regexgen requires anyway. But you're right, we don't currently pass the flags to the generation phase.
Huh, so it does! I think I was doing something wrong before when I was getting an error!
doesn't Regenerate have a unicode es6 mode as well?
It does, but I don’t know of any languages that support escape sequences of the form \u{…}
, so I thought that wouldn’t be (directly) helpful.
Other than differing semantics for the \x
escape (must be two bytes), the ES6-style \u{}
escapes work just fine in Ruby, for one! 😄
I’ve added #17, which seems to mostly work™ except for one meta character escaping regression I can’t quite wrap my head around. Any input on that would be most welcome! 😃
I’ve now also created mathiasbynens/emoji-regex#22, which implements #17’s updated behaviour. :smile:
The underlying
jsesc
library accepts anes6
parameter, which causes it to output ES6-compatible escapes.Exposing this option would be useful for, among other things, making the output of regexen more applicable to languages whose backing string format isn’t UTF-16!