Regexp::Scanner::PrematureEndError: Premature end of pattern at #{str}

backus commented 8 years ago

Example:

2.3.0 :001 > /\#{str}/
 => /\#{str}/
2.3.0 :003 > require 'regexp_parser'
 => true
2.3.0 :004 > Regexp::Parser.parse('\#{str}')
Regexp::Scanner::PrematureEndError: Premature end of pattern at #{str}
  from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/scanner.rb:1698:in `scan'
  from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/lexer.rb:20:in `lex'
  from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/parser.rb:26:in `parse'
  from (irb):4
  from /Users/johnbackus/.rvm/rubies/ruby-2.3.0/bin/irb:11:in `<main>'

I don't understand yet what the source of this issue is

jaynetics commented 8 years ago

I'm also interested in this, and some issues which are probably related:

Regexp::Parser.parse('{str}') # => PrematureEndError
Regexp::Parser.parse('\#{}') # => NoMethodError
Regexp::Parser.parse('{}') # => NoMethodError

I guess the parser is trying to find a full interval quantifier whenever it encounters a { that is not preceded by a (non-escaped) #.

backus commented 8 years ago

@janosch-x are you sure you get a NoMethodError for your third example? I get the following:

Regexp::Parser.parse(/{}/) # => ArgumentError: No valid target found for '{}' quantifier

backus commented 8 years ago

Also I think the rough explanation for the problem is this:

The parser assumes that when it encounters the { token it is going to parse either {\d} or {\d,\d}. These would be the patterns for forming a valid quantifier.

The issue is that Ruby seems to default to just matching on plain text if the quantifier doesn't make sense. For example, the regex /\Aa{a}\z/ doesn't make sense if you view the {} as a quantifier so ruby just doesn't parse it as an quantifier:

2.3.0 :036 > regex = /\Aa{a}\z/
 => /\Aa{a}\z/
2.3.0 :037 > regex =~ 'a'
 => nil
2.3.0 :038 > regex =~ 'a{a}'
 => 0
2.3.0 :039 > Regexp::Parser.parse(regex)
Regexp::Scanner::PrematureEndError: Premature end of pattern at a{a}\z

Unfortunately this doesn't seem to be part of the grammar for this gem.

ammar commented 8 years ago

I haven't investigated this in depth yet, but it looks like the treatment of a { and } as literals in certain cases is an implementation quirk of the regex engine. In other words, it's not a documented feature.

Ruby's documentation explicitly states that meta characters must be backslash-escaped when they are used as literals:

The following are metacharacters (, ), [, ], {, }, ., ?, +, *. They have a specific
meaning when appearing in a pattern. To match them literally they must be
backslash-escaped.

Unless there is documentation that details the cases in which unescaped meta characters will be treated as literals, then I think it is safe to consider this behavior an implementation quirk.

I consider using and counting on such quirks to be bad practice, and I prefer not to make the parser accommodate their use.

Despite that, I understand the impact of such issues on the suitability of the parser for certain applications. I would like to find a balance between keeping the parser free from supporting quirks and correctly detecting them. Perhaps by adding a validation phase, which runs before the scanner and issues warnings or errors for questionable patterns like this and the one in issue #3 (which I should update with my findings and mark as wont fix).

I would like to dig a little deeper to see how Ruby represents these patterns internally. If it is fixed and predictable, I might reconsider addressing them.

mbj commented 8 years ago

@ammar On a specification that is as vague as Ruby:

Ruby is defined by its implementation
Bugs are present
Some bugs are declared features because to many depend on them
Lots of churn in MRI itself
Absence of MRI doing corpus testing

Its not possible to correctly re-implement "Ruby" because the definition of "Ruby" is under steady flux.

Hence I propose to never even try to implement "Ruby" but implement a sane subset, explicitly not supporting stuff that does not make sense outside MRI implementation quirks.

On how to not explicitly support something I heavily recommend raising errors instead of warnings, because this makes it much easier for downstream developers to realize: Okay MRI edge case, do not provide such an input.

mbj commented 8 years ago

@ammar AKA Imo: Ideally regexp_parser does one of the following on this case:

Nothing (but document the fact in the readme)
A dedicated exception

backus commented 8 years ago

@mbj what does doing "Nothing (but document the fact in the readme)" look like?

mbj commented 8 years ago

@backus It raises exceptions now, keep it like this. But document the fact that regexp_parser does not support each MRI quirk.

ammar commented 8 years ago

@mbj Thank you for chiming in. Those are very good points about the discrepancies between declared features and the actual implementation.

Regarding what to do, I agree, and think that a PrematureEndError is a fitting exception in this case. I will update the Supported Syntax section of the README to note that regexp_parser does not support MRI's quirks.

backus commented 8 years ago

Closing this for now since I think the current error is appropriate

ammar / regexp_parser

Regexp::Scanner::PrematureEndError: Premature end of pattern at #{str} #15