Closed backus closed 8 years ago
I'm also interested in this, and some issues which are probably related:
Regexp::Parser.parse('{str}') # => PrematureEndError
Regexp::Parser.parse('\#{}') # => NoMethodError
Regexp::Parser.parse('{}') # => NoMethodError
I guess the parser is trying to find a full interval quantifier whenever it encounters a {
that is not preceded by a (non-escaped) #
.
@janosch-x are you sure you get a NoMethodError
for your third example? I get the following:
Regexp::Parser.parse(/{}/) # => ArgumentError: No valid target found for '{}' quantifier
Also I think the rough explanation for the problem is this:
The parser assumes that when it encounters the {
token it is going to parse either {\d}
or {\d,\d}
. These would be the patterns for forming a valid quantifier.
The issue is that Ruby seems to default to just matching on plain text if the quantifier doesn't make sense. For example, the regex /\Aa{a}\z/
doesn't make sense if you view the {}
as a quantifier so ruby just doesn't parse it as an quantifier:
2.3.0 :036 > regex = /\Aa{a}\z/
=> /\Aa{a}\z/
2.3.0 :037 > regex =~ 'a'
=> nil
2.3.0 :038 > regex =~ 'a{a}'
=> 0
2.3.0 :039 > Regexp::Parser.parse(regex)
Regexp::Scanner::PrematureEndError: Premature end of pattern at a{a}\z
Unfortunately this doesn't seem to be part of the grammar for this gem.
I haven't investigated this in depth yet, but it looks like the treatment of a { and } as literals in certain cases is an implementation quirk of the regex engine. In other words, it's not a documented feature.
Ruby's documentation explicitly states that meta characters must be backslash-escaped when they are used as literals:
The following are metacharacters (, ), [, ], {, }, ., ?, +, *. They have a specific
meaning when appearing in a pattern. To match them literally they must be
backslash-escaped.
Unless there is documentation that details the cases in which unescaped meta characters will be treated as literals, then I think it is safe to consider this behavior an implementation quirk.
I consider using and counting on such quirks to be bad practice, and I prefer not to make the parser accommodate their use.
Despite that, I understand the impact of such issues on the suitability of the parser for certain applications. I would like to find a balance between keeping the parser free from supporting quirks and correctly detecting them. Perhaps by adding a validation phase, which runs before the scanner and issues warnings or errors for questionable patterns like this and the one in issue #3 (which I should update with my findings and mark as wont fix
).
I would like to dig a little deeper to see how Ruby represents these patterns internally. If it is fixed and predictable, I might reconsider addressing them.
@ammar On a specification that is as vague as Ruby:
Its not possible to correctly re-implement "Ruby" because the definition of "Ruby" is under steady flux.
Hence I propose to never even try to implement "Ruby" but implement a sane subset, explicitly not supporting stuff that does not make sense outside MRI implementation quirks.
On how to not explicitly support something I heavily recommend raising errors instead of warnings, because this makes it much easier for downstream developers to realize: Okay MRI edge case, do not provide such an input.
@ammar AKA Imo: Ideally regexp_parser
does one of the following on this case:
@mbj what does doing "Nothing (but document the fact in the readme)" look like?
@backus It raises exceptions now, keep it like this. But document the fact that regexp_parser
does not support each MRI quirk.
@mbj Thank you for chiming in. Those are very good points about the discrepancies between declared features and the actual implementation.
Regarding what to do, I agree, and think that a PrematureEndError
is a fitting exception in this case. I will update the Supported Syntax section of the README to note that regexp_parser
does not support MRI's quirks.
Closing this for now since I think the current error is appropriate
Example:
I don't understand yet what the source of this issue is