ammar / regexp_parser

A regular expression parser library for Ruby
MIT License
143 stars 22 forks source link

Overhaul of Set#members needed #47

Closed jaynetics closed 6 years ago

jaynetics commented 6 years ago

Right now, handling the content of character sets with regexp_parser is hard:

What I have in mind as a general solution is the following:

Thus, parsing /a[bc-d]/ could yield something like

#<Root @expressions=[
  #<Literal @type=:literal, @token=:literal, @text="a" >,
  #<CharacterSet @expressions=[
    #<Literal @type=:literal, @token=:literal, @text="b" >,
    #<Range @type=:set, @token=:range, @expressions=[
      #<Literal @type=:literal, @token=:literal, @text="c" >,
      #<Literal @type=:literal, @token=:literal, @text="d" >
    ]>
  ]>
]>

The only tricky bit is rewiring the ragel machines in the right way and catching all ranges. On the other hand, it would probably lead to less code, as special treatment is only needed for a few things within sets: the set tokens plus ., \b, and [:...:] if I am not mistaken.

What do you think, @ammar?

ammar commented 6 years ago

Thanks for this @janosch-x.

Characters sets are the poorest implemented part of the scanner. Barely satisfied the most basic of cases.

I like the approach you outlined. Removing :subset and treating :set members as sub-expressions is a great idea. It should simplify the scanner and its use.

I wonder if extracting the character set scanning logic into a sub-machine, like properties.rl, would make it easier to implement the changes. it might require breaking scanner.rl into smaller parts, which could be a good thing. I'd like to explore that a little, soon hopefully.

jaynetics commented 6 years ago

I wonder if extracting the character set scanning logic into a sub-machine, like properties.rl, would make it easier to implement the changes.

Thats what I thought, too. I might be able to help a bit with the whole thing, at least by setting up tests beforehand.

Another thing that just crossed my mind is that it could make sense to have an Intersection expression with subexpressions as well. Right now, you have to keep track of the preceding and succeeding set member yourself to find out what is being intersected. We could do something like this instead:

Regexp::Parser.parse(/[a-c&&b]/) # =>
#<Root @expressions=[
  #<CharacterSet @expressions=[
    #<Intersection @type=:set, @token=:intersection, @expressions=[
      #<Range @type=:set, @token=:range, @expressions=[
        #<Literal @type=:literal, @token=:literal, @text="a" >,
        #<Literal @type=:literal, @token=:literal, @text="c" >
      ]>,
      #<Literal @type=:literal, @token=:literal, @text="b" >
    ]>
  ]>
]>

The least complicated way to achieve this might be doing it purely in parser.rb, kind of the same way alternation expressions are handled.

jaynetics commented 6 years ago

resolved by #55