Suggestion / TBD: improve options handling

jaynetics commented 7 years ago

Right now, it is hard to tell which options apply to a specific expression when passing in a ::Regexp with flags or when there are groups with option flags.

E.g.

P = Regexp::Parser
P.parse(/a/i).expressions[0].i?                      # => false 
P.parse(/(a)/i).expressions[0].i?                    # => false 
P.parse(/x(?i)a/).expressions.last.i?                # => false 
P.parse(/(?i:(a))/).expressions[0].expressions[0].i? # => false
P.parse(/(?i-i:a)/).expressions[0].i?                # => true (should be false)

To correctly determine wether a is case-insensitive, a user would have to keep track of the root options and any "currently active" option groups himself.

@ammar if you agree, I think we could improve this by setting a correct options hash for every Token, as follows:

add another field to the Token class, :options, in token.rb:11
remove Parser::options and move all option processing to the Lexer instead. imho currently active options are similar to group_level or set_level.
set up initial @options based on input in the Lexer (use an empty Hash if input is a ::String instead of a ::Regexp)
dup and update these @options whenever encountering a group with option flags
keep track of option group "history" to correctly handle nesting
split [:group, :options, ...] into two tokens: [:group, :options_local, ...] and [:group, :options_switch, ...], to allow differentiating between group-local option modifications (e.g. /a(?i:b)c/) and those that persist after the group is closed (e.g. /a(?i)bc/)
pass these @options to every new Token created in lexer.rb:30
remove custom handling of options from root.rb
change option methods in expression.rb from (@options and @options[:m]) ? true : false to !!@options[:m]

ammar commented 7 years ago

@janosch-x That's an interesting problem. I can see different interpretations, depending on which layer is being considered (scanner, lexer, or parser), and the use case.

For me, so far at least, a Token is a mid level representation, that does not capture the semantics of execution for the entire expression, it only captures the semantics of definition for individual expressions.

On the other hand, I can see an Expression as a representation of the semantics of execution, not that I ever meant for them to be executable.

My initial take might be colored by the fact that all the examples you gave are for Expression. So, I'm wondering if Expressions could 'ask' the expression that contains them what options are 'in-effect', if they don't have that answer themselves.

I think there's a bug at play too. I would have expected the following (without .expressions[0]) to return true:

P.parse(/a/i).i?

The last example is weird :), but I agree. I expected false.

jaynetics commented 7 years ago

The use case is as follows:

I want to convert Ruby regexes to allow them to work on simpler regex engines, e.g. JavaScript. To substitute option groups, I have conversions such as the following in mind:

fancy_ruby = /ab(?i:cd[ef])/
simplified = /ab(?:[cC][dD][eEfF])/

This is hardly an essential feature, so I would not want to worsen the clarity or technical debt of regexp_parser to achieve it.

For me, so far at least, a Token is a mid level representation, that does not capture the semantics of execution for the entire expression, it only captures the semantics of definition for individual expressions.

I might not have fully understood this distinction, but don't Tokens already have some attributes that cross the boundary from individual definition to execution terms, e.g. set_level?

So, I'm wondering if Expressions could 'ask' the expression that contains them what options are 'in-effect', if they don't have that answer themselves.

That was my initial thought as well. The problem is that not only the containing expression is relevant, but also potentially all proceeding expressions. E.g.

/ (a (?i) b (c)) (?i: d (?-i) e) /x
# b, c and d are case-insensitive, a and e are not

That's why I thought that making "currently active options" a "state" of one of the three layers and setting them for every output object might make sense.

ammar commented 7 years ago

Thanks. I understand the intention now.

I might not have fully understood this distinction, but don't Tokens already have some attributes that cross the boundary from individual definition to execution terms, e.g. set_level?

I don't see set_level? as serving execution per se. I see it serving the parser. I might be splitting hairs here though, especially since execution was never a goal.

Your suggestion makes perfect sense. It supports the main use case for regexp_parser: transformations. Also it doesn't seem like it would cause any harm to add significant complexity.

Do you it will require a minor version bump to 0.5.0? And is there anything I can help with?

jaynetics commented 7 years ago

Ok, you've convinced me to add it to the Parser instead of the Lexer...

Do you think it will require a minor version bump to 0.5.0?

Imho a patch version bump will do, as the old behavior was kind of counter-intuitive and it would be strange for anyone to rely on it.

Also, I did without adding a new token type :options_switch yet. We can make that potentially breaking change and remove some redundancy in v1.0.0.

And is there anything I can help with?

Take a look at the PR :)

ammar / regexp_parser

Suggestion / TBD: improve options handling #43