ammar / regexp_parser

A regular expression parser library for Ruby
MIT License
143 stars 22 forks source link

Regexp::Scanner::PrematureEndError: Premature end of pattern at #{str} #15

Closed backus closed 8 years ago

backus commented 8 years ago

Example:

2.3.0 :001 > /\#{str}/
 => /\#{str}/
2.3.0 :003 > require 'regexp_parser'
 => true
2.3.0 :004 > Regexp::Parser.parse('\#{str}')
Regexp::Scanner::PrematureEndError: Premature end of pattern at #{str}
  from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/scanner.rb:1698:in `scan'
  from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/lexer.rb:20:in `lex'
  from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/parser.rb:26:in `parse'
  from (irb):4
  from /Users/johnbackus/.rvm/rubies/ruby-2.3.0/bin/irb:11:in `<main>'

I don't understand yet what the source of this issue is

jaynetics commented 8 years ago

I'm also interested in this, and some issues which are probably related:

Regexp::Parser.parse('{str}') # => PrematureEndError
Regexp::Parser.parse('\#{}') # => NoMethodError
Regexp::Parser.parse('{}') # => NoMethodError

I guess the parser is trying to find a full interval quantifier whenever it encounters a { that is not preceded by a (non-escaped) #.

backus commented 8 years ago

@janosch-x are you sure you get a NoMethodError for your third example? I get the following:

Regexp::Parser.parse(/{}/) # => ArgumentError: No valid target found for '{}' quantifier
backus commented 8 years ago

Also I think the rough explanation for the problem is this:

The parser assumes that when it encounters the { token it is going to parse either {\d} or {\d,\d}. These would be the patterns for forming a valid quantifier.

The issue is that Ruby seems to default to just matching on plain text if the quantifier doesn't make sense. For example, the regex /\Aa{a}\z/ doesn't make sense if you view the {} as a quantifier so ruby just doesn't parse it as an quantifier:

2.3.0 :036 > regex = /\Aa{a}\z/
 => /\Aa{a}\z/
2.3.0 :037 > regex =~ 'a'
 => nil
2.3.0 :038 > regex =~ 'a{a}'
 => 0
2.3.0 :039 > Regexp::Parser.parse(regex)
Regexp::Scanner::PrematureEndError: Premature end of pattern at a{a}\z

Unfortunately this doesn't seem to be part of the grammar for this gem.

ammar commented 8 years ago

I haven't investigated this in depth yet, but it looks like the treatment of a { and } as literals in certain cases is an implementation quirk of the regex engine. In other words, it's not a documented feature.

Ruby's documentation explicitly states that meta characters must be backslash-escaped when they are used as literals:

The following are metacharacters (, ), [, ], {, }, ., ?, +, *. They have a specific
meaning when appearing in a pattern. To match them literally they must be
backslash-escaped.

Unless there is documentation that details the cases in which unescaped meta characters will be treated as literals, then I think it is safe to consider this behavior an implementation quirk.

I consider using and counting on such quirks to be bad practice, and I prefer not to make the parser accommodate their use.

Despite that, I understand the impact of such issues on the suitability of the parser for certain applications. I would like to find a balance between keeping the parser free from supporting quirks and correctly detecting them. Perhaps by adding a validation phase, which runs before the scanner and issues warnings or errors for questionable patterns like this and the one in issue #3 (which I should update with my findings and mark as wont fix).

I would like to dig a little deeper to see how Ruby represents these patterns internally. If it is fixed and predictable, I might reconsider addressing them.

mbj commented 8 years ago

@ammar On a specification that is as vague as Ruby:

Its not possible to correctly re-implement "Ruby" because the definition of "Ruby" is under steady flux.

Hence I propose to never even try to implement "Ruby" but implement a sane subset, explicitly not supporting stuff that does not make sense outside MRI implementation quirks.

On how to not explicitly support something I heavily recommend raising errors instead of warnings, because this makes it much easier for downstream developers to realize: Okay MRI edge case, do not provide such an input.

mbj commented 8 years ago

@ammar AKA Imo: Ideally regexp_parser does one of the following on this case:

backus commented 8 years ago

@mbj what does doing "Nothing (but document the fact in the readme)" look like?

mbj commented 8 years ago

@backus It raises exceptions now, keep it like this. But document the fact that regexp_parser does not support each MRI quirk.

ammar commented 8 years ago

@mbj Thank you for chiming in. Those are very good points about the discrepancies between declared features and the actual implementation.

Regarding what to do, I agree, and think that a PrematureEndError is a fitting exception in this case. I will update the Supported Syntax section of the README to note that regexp_parser does not support MRI's quirks.

backus commented 8 years ago

Closing this for now since I think the current error is appropriate