Regex for unicode characters by code point

Engelberg / instaparse

Eclipse Public License 1.0

2.74k stars 149 forks source link

Regex for unicode characters by code point #218

Closed getify closed 1 year ago

getify commented 1 year ago

I've been using a helpful web tool that utilized a Clojurescript port of instaparse. Their web tool lets me author/test productions in a browser web page (which I've found very convenient, especially when collaborating with others).

Unfortunately, I'm having a problem with some regex needs. I was using a syntax I believe to be valid for Clojure/Java regexes, specifically for specifying unicode characters by code point.

Details of my issue are here: https://github.com/mdkrajnak/ebnftest/issues/1

I wanted to cross-post here in case you might share some insight (or any links you may have) into what specific regex syntax I need to use? Thank you.

Engelberg commented 1 year ago

On Clojure, instaparse is using Java regexes. I'm less familiar with the Clojurescript port, but it wouldn't surprise me if the Clojurescript port uses whatever regexes are supported by Javascript, specifically whatever is supported by the Google Closure compiler utilized by clojurescript.

getify commented 1 year ago

Thank you. That should have been obvious to me, but I hadn't thought to check.

I just did a search through the minified bundle on that web tool that is the port of instaparse, and its usages of JS RegExp indeed do not seem to be passing the unicode flag. So I think that explains my issue.

Not sure what I'll be able to do to work around this. But appreciate the pointer.

Just curious: does Clojure or Instaparse have any facility that could be used to force the underlying regex to be unicode aware? I suppose, as you mentioned, that it's entirely up to the compiler (google closure).

Engelberg commented 1 year ago

I can't think of anything beyond whatever clojurescript does.

getify commented 1 year ago

I worked around my issue by not using a regex and just including the string literal characters in my productions.