Some regex bugs - Githubissues

gfredericks / test.chuck

A utility library for test.check

Eclipse Public License 1.0

214 stars 26 forks source link

Some regex bugs #63

Closed gfredericks closed 4 years ago

gfredericks commented 5 years ago

A java-8 bug that I somehow missed originally was reported in c228537a199801d; might be tricky (assuming it really has no matches), because I don't know if we currently parse anything that has no matches; debugging approach is probably to run it through the QE parsing method in the Pattern class to see what comes out the other end
There are at least two new bugs for java 9-or-later, that I mentioned in #62: \X and \N{WHITE SMILING FACE}; \X can probably be parsed-but-not-supported (unless the definition turns out to be super easy to implement), and the other one might be an easy lookup on the Character class or something, we'll see

lvh commented 5 years ago

I think the answer to \N is:

(Character/codePointOf "WHITE SMILING FACE")
(Character/codePointOf "some nonexistent nonsense")

... which only exists in JDK9+. If you need it to work below that, there's CharacterName/getCodePoint but that appears to be a package-scoped class.

gfredericks commented 5 years ago

I don't think the \N construct is a valid regex pre-JDK9 -- my goal with this functionality is to correctly parse/interpret things according to re-pattern's behavior -- i.e., parsing and interpreting relative to the jvm you're running on.

There's already one or two variable features for things that differ between 7 and 8. I just did all this work prior to 9.

Probably don't need to support 7 anymore (since clojure doesn't, I don't think?), so some of that variability can be removed.

gfredericks commented 5 years ago

and yes, Character/codePointOf looks like exactly what we'd need, thanks for looking that up

gfredericks commented 5 years ago

(I'm planning on digging into this in early July if nobody else gets to it first)

gfredericks commented 4 years ago

Just pushed fixes for both of these. \X is parsed but unsupported, \c\Q0 does the correct (insane) thing, and \N{...} is fully supported. Additionally, large code-points are now supported with \x and \u literals.