ammar / regexp_parser

A regular expression parser library for Ruby
MIT License
143 stars 22 forks source link

Inconsistent scanning of properties within sets #28

Closed jaynetics closed 7 years ago

jaynetics commented 7 years ago

Outside of sets, Regexp::Scanner recognizes two types of nonproperty:

S = Regexp::Scanner
S.scan /\p{ascii}/    # => [[:property,    :ascii, "\\p{ascii}", 0, 9]]
S.scan /\p{^ascii}/   # => [[:nonproperty, :ascii, "\\p{^ascii}", 0, 10]]
S.scan /\P{ascii}/    # => [[:nonproperty, :ascii, "\\P{ascii}", 0, 9]]

Within sets, only one of them is recognized as nonproperty:

S.scan /[\p{ascii}]/  # => [..., [:set, :ascii, "\\p{ascii}", 1, 10], ...]]
S.scan /[\p{^ascii}]/ # => [..., [:nonproperty, :ascii, "\\p{^ascii}", 1, 11], ...]]
S.scan /[\P{ascii}]/  # => [..., [:set, :ascii, "\\P{ascii}", 1, 10], ...]]

And I guess you would actually see it as a bug that \p{^...} does not get the type :set like everything else in a set, is that right? That would be easy to fix, and I think it would make sense from the viewpoint of consistency to change that.

However, fixing that would still make it necessary to scan the data of properties in sets for /\\(P|p\{\^)/ to detect whether they are negative. That is different from how they are handled outside of sets and different from how classes are handled, as classes have their "polarity" encoded at index 1:

S.scan /[[:ascii:]]/  # => [..., [:set, :class_ascii, "[:ascii:]", 1, 10], ...]
S.scan /[[:^ascii:]]/ # => [..., [:set, :class_nonascii, "[:^ascii:]", 1, 10], ...]

IMHO the ideal solution would be if all the information that tokens generate outside of sets was also available when they occur in a set. I am aware that this would require a substantial refactoring and would not be backwards compatible. But it would greatly help acting on specific tokens (escapes are another example) wherever they occur, without having to scan the token data.

Any thoughts?

ammar commented 7 years ago

I do see the inconsistency as a bug.

Looks like the in_set global is not being set correctly and emitting the inconsistent type here. I expect the same is happening in the :subset case.

The solution you suggest makes sense, and is very tempting. Having the same characters emit the same tokens, at least within sets at first, can simplify certain applications. It's a different level of consistency.

I would like to assess the amount of work needed for that change and consider the backward compatibility implications for a bit.

In the meantime, fixing the inconsistent negated property type seems like a reasonable first step.

Thanks for submitting this issue!