Closed jaynetics closed 6 years ago
Thanks for this @janosch-x.
Characters sets are the poorest implemented part of the scanner. Barely satisfied the most basic of cases.
I like the approach you outlined. Removing :subset
and treating :set
members as sub-expressions is a great idea. It should simplify the scanner and its use.
I wonder if extracting the character set scanning logic into a sub-machine, like properties.rl
, would make it easier to implement the changes. it might require breaking scanner.rl
into smaller parts, which could be a good thing. I'd like to explore that a little, soon hopefully.
I wonder if extracting the character set scanning logic into a sub-machine, like properties.rl, would make it easier to implement the changes.
Thats what I thought, too. I might be able to help a bit with the whole thing, at least by setting up tests beforehand.
Another thing that just crossed my mind is that it could make sense to have an Intersection
expression with subexpressions as well. Right now, you have to keep track of the preceding and succeeding set member yourself to find out what is being intersected. We could do something like this instead:
Regexp::Parser.parse(/[a-c&&b]/) # =>
#<Root @expressions=[
#<CharacterSet @expressions=[
#<Intersection @type=:set, @token=:intersection, @expressions=[
#<Range @type=:set, @token=:range, @expressions=[
#<Literal @type=:literal, @token=:literal, @text="a" >,
#<Literal @type=:literal, @token=:literal, @text="c" >
]>,
#<Literal @type=:literal, @token=:literal, @text="b" >
]>
]>
]>
The least complicated way to achieve this might be doing it purely in parser.rb
, kind of the same way alternation expressions are handled.
resolved by #55
Right now, handling the content of character sets with
regexp_parser
is hard:Scanner
only detects few ranges successfully, as detailed in issue #29.Scanner
returns inconclusive information about member tokens because they all have the type:set
. Issue #28 describes this for properties, but it also affects \a, \e, \n, \t, \u, \v and more.Parser
then "throws away" even this limited information as it only relays theToken#text
toSet#members
. (Re-runningParser#parse
on individualSet#members
is a poor workaround for this.)What I have in mind as a general solution is the following:
:subset
token type, leaving#set_level
to differentiate between sets and subsets:set
token type only for tokens that are particular to sets ([
,^
,&&
,]
and ranges):member
,:member_hex
,:range
and:range_hex
tokensSet#members
, leaving#expressions
to access members, ranges and subsetsThus, parsing
/a[bc-d]/
could yield something likeThe only tricky bit is rewiring the ragel machines in the right way and catching all ranges. On the other hand, it would probably lead to less code, as special treatment is only needed for a few things within sets: the set tokens plus
.
,\b
, and[:...:]
if I am not mistaken.What do you think, @ammar?