ammar / regexp_parser

A regular expression parser library for Ruby
MIT License
143 stars 22 forks source link

Rethink fallbacks for formally incorrect grammar #63

Closed zverok closed 4 years ago

zverok commented 4 years ago

Hi, and thanks for the awesome gem!

Recently regexp_parser started to be used in Rubocop to check regexp redundancy.

That led to uncovering of what can be considered as a bug (rubocop bug: https://github.com/rubocop-hq/rubocop/issues/8083). When parsing regexps like this, for example: /{.+}/ (which is valid Ruby regexp), regexp_parser fails (thinking that {} is incorrect quantifier). Same is related to some other forms, like /]\[/

I found out that the matter was discussed at #15, with verdict being, that it

...is an implementation quirk of the regex engine. In other words, it's not a documented feature.

...Hence I propose to never even try to implement "Ruby" but implement a sane subset, explicitly not supporting stuff that does not make sense outside MRI implementation quirks.

...It raises exceptions now, keep it like this. But document the fact that regexp_parser does not support each MRI quirk.

Actually, I believe that it is not "MRI quirk", but sane behavior of the Regexp parser, that some characters have special meaning only in context. The behavior about parsing {something that is not a quantifier}, and ] is consistent through:

So, it seems that parser that fails on those cases becomes less useful than it might be.

jaynetics commented 4 years ago

the argument about usability and prevalence in other engines looks really convincing to me.

regexp_parser seems to be used more and more on regular expressions encountered "in the wild", and there is no good workaround for these cases there.

as far as rubocop goes, you might catch any PrematureEndError and then skip the affected cop, but this is obviously not very appealing.

generally speaking, the current limitation might push people to either not use regexp_parser for such cases, or to attempt to pre-escape their input with their own regexp-scanning implementation, or catch the error and unexpectedly do nothing.

so i'm all on board. any opposition, @ammar?

if there is no documentation it might be best to look at the onigmo code for details about the behavior.

some things to keep in mind / investigate:

ammar commented 4 years ago

Thanks for the mention @jaynetics.

I also think that is convincing for some cases. Also, for ruby at least, the supported variants of regexp have stabilized (for example: https://bugs.ruby-lang.org/issues/8133#note-5)

I think the following cases are reasonable:

I have doubts about other empty or unbalanced cases, like ().

I can take a stab at it this coming weekend.

jaynetics commented 4 years ago

balanced curlies with content that doesn't match \d+(,\d*)? are also consistently treated as literals across rubies and other languages:

/a{2, 3}/ =~ 'a{2, 3}' # => 0
/a{2,3,4}/ =~ 'a{2,3,4}' # => 0

empty () also seem to be widely supported and treated as group that always matches. there are some use cases for that, too - breaking runs of other elements such as backref numbers, or achieving a desired numbering for following captures.

ammar commented 4 years ago

Treating a {...} that don't match \d+(,\d*)? as literals makes sense.

I haven't encountered that usage of (). If it emits an empty group, instead of a literal, then that makes sense.

My understanding of the concerns of the parser has evolved over time. I no longer see "validation" as one of them. These cases enforce that understanding.