ammar / regexp_parser

A regular expression parser library for Ruby
MIT License
144 stars 23 forks source link

Improve set handling #55

Closed jaynetics closed 6 years ago

jaynetics commented 6 years ago

This makes CharacterSet a standard Subexpression as suggested in https://github.com/ammar/regexp_parser/issues/47#issue-275073366

All equivalent tokens result in the same Scanner and Parser emissions as outside of sets.

New CharacterSet::Range and CharacterSet::Intersection expressions represent respective trees.

Other notable changes are:

example from type, token to type, token from exp to exp
[\b] :set, :backspace :escape, :backspace none/String ES::Backspace
[[:xy:]] :set, :char_xy :posixclass, :xy none/String PosixClass
[[:^xy:]] :set, :char_nonxy :nonposixclass, :xy none/String PosixClass
\x20 :escape, :hex :escape, :hex ES::Literal ES::Hex
\x20 :escape, :octal :escape, :octal ES::Literal ES::Octal
\u1234 :escape, :codepoint :escape, :codepoint ES::Literal ES::Codepoint
\u{12 34} :escape, :codepoint_list :escape, :codepoint_list ES::Literal ES::CodepointList

@ammar What do you think? The commit messages provide a bit more explanation if you are wondering about some of the changes, but feel free to suggest any other solution.

jaynetics commented 6 years ago

Turns out I should have read the docs... https://github.com/k-takata/Onigmo/blob/79114095/doc/RE#L155-L156

Intersections apply to all expressions in their set, not just adjacent ones.

'abc1'.scan(/[a b \d && b c [:digit:]]/x) # => ["b", "1"]
'abc1'.scan(/[^a b \d && b c [:digit:]]/x) # => ["a", "c"]

So maybe Intersection parse results need to look somewhat like this:

RP.parse(/[a&&b]/).first.first # =>
  #<Intersection @expressions=[
    #<Intersection::Left @expressions=[
      #<Literal @text="a"/>
    ],
    #<Intersection::Right @expressions=[
      #<Literal @text="b"/>
    ]/>
  ]/>

Now that would require quite a bit of tree restructuring while parsing.

Not to mention that there can be more than one intersection:

'abc1&'.scan(/[abc && ab && bc]/x) # => ["b"]

Another option could be to treat Sets as group of Sequences by default, which, however, might make them harder to handle just for this somewhat exotic feature.

Hmmm ...

jaynetics commented 6 years ago

I'm reasonably happy with this now ...