Improve set handling - Githubissues

jaynetics commented 6 years ago

This makes CharacterSet a standard Subexpression as suggested in https://github.com/ammar/regexp_parser/issues/47#issue-275073366

All equivalent tokens result in the same Scanner and Parser emissions as outside of sets.

New CharacterSet::Range and CharacterSet::Intersection expressions represent respective trees.

Other notable changes are:

example	from type, token	to type, token	from exp	to exp
[\b]	:set, :backspace	:escape, :backspace	none/String	ES::Backspace
[[:xy:]]	:set, :char_xy	:posixclass, :xy	none/String	PosixClass
[[:^xy:]]	:set, :char_nonxy	:nonposixclass, :xy	none/String	PosixClass
\x20	:escape, :hex	:escape, :hex	ES::Literal	ES::Hex
\x20	:escape, :octal	:escape, :octal	ES::Literal	ES::Octal
\u1234	:escape, :codepoint	:escape, :codepoint	ES::Literal	ES::Codepoint
\u{12 34}	:escape, :codepoint_list	:escape, :codepoint_list	ES::Literal	ES::CodepointList

@ammar What do you think? The commit messages provide a bit more explanation if you are wondering about some of the changes, but feel free to suggest any other solution.

jaynetics commented 6 years ago

Turns out I should have read the docs... https://github.com/k-takata/Onigmo/blob/79114095/doc/RE#L155-L156

Intersections apply to all expressions in their set, not just adjacent ones.

'abc1'.scan(/[a b \d && b c [:digit:]]/x) # => ["b", "1"]
'abc1'.scan(/[^a b \d && b c [:digit:]]/x) # => ["a", "c"]

So maybe Intersection parse results need to look somewhat like this:

RP.parse(/[a&&b]/).first.first # =>
  #<Intersection @expressions=[
    #<Intersection::Left @expressions=[
      #<Literal @text="a"/>
    ],
    #<Intersection::Right @expressions=[
      #<Literal @text="b"/>
    ]/>
  ]/>

Now that would require quite a bit of tree restructuring while parsing.

Not to mention that there can be more than one intersection:

'abc1&'.scan(/[abc && ab && bc]/x) # => ["b"]

Another option could be to treat Sets as group of Sequences by default, which, however, might make them harder to handle just for this somewhat exotic feature.

Hmmm ...

jaynetics commented 6 years ago

I'm reasonably happy with this now ...

ammar / regexp_parser

Improve set handling #55