erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.36k stars 2.95k forks source link

ERL-31: leex ignores {m,n} and {m} repetition quantifiers #3053

Closed OTP-Maintainer closed 3 years ago

OTP-Maintainer commented 9 years ago

Original reporter: sunaku Affected version: OTP-18.1 Component: parsetools Migrated from: https://bugs.erlang.org/browse/ERL-31


Hello,

I'm using Erlang/OTP 18.1 ([erts-7.1] [source] [64-bit] [smp:24:24]
[async-threads:10] [hipe] [kernel-poll:false]) where I find that leex
does not honor extended regexp numerical repetition quantifiers [1].

For example, I'm trying to recognize exactly 11, 10, or 7 uppercase
alphanumeric characters as a product_id in a leex rule, as follows:
{code:erlang}
[A-Z0-9]{10,11}|[A-Z0-9]{7}
           : {token, {product_id, TokenLine, TokenChars}}.
{code}
However, that didn't work, so I tried using AWK syntax [2] exactly:
{code:erlang}
[A-Z0-9]\{10,11\}|[A-Z0-9]\{7\}
           : {token, {product_id, TokenLine, TokenChars}}.
{code}
However, that didn't work either, so I was forced to expand it out:
{code:erlang}
[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]|[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]|[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]
           : {token, {product_id, TokenLine, TokenChars}}.
{code}
This is a giant step backwards in terms of regexp readability and
maintainability, so please make leex honor the extended regexp
numerical repetition quantifiers [1] since AWK supports them [2].

Furthermore, it seems that this work had already been started [3]
but, for reasons unknown to me, was commented-out in the code [4].

Thanks for your consideration

[1] http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_06
[2] http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_04
[3] https://github.com/erlang/otp/blob/maint/lib/parsetools/src/leex.erl#L138
[4] https://github.com/erlang/otp/blob/maint/lib/parsetools/src/leex.erl#L692-L700
OTP-Maintainer commented 9 years ago

jlouis said:

You can probably handle this by simply recognizing [A-Z0-9]+ and then handle it in a function:
{code:erlang}
[A-Z0-9]+ : recognize_length(TokenLine, TokenChars)

recognize_length(Line, Chs) when length(Chs) >= 10 andalso length(Chs) =< 11 ->
    {token, {product_id, Line, Chs}};
recognize_length(Line, Chs) when length(Chs) == 7 ->
    {token, {product_id, Line, Chs}};
recognize_length(_, _) -> {error, blargh}.
{code}
I'm not really sure why you want to extend leex with extensions to its regex engines. It isn't that common in lexers to provide these kinds of extensions, based on my knowledge of them over the years.
OTP-Maintainer commented 9 years ago

sunaku said:

Thanks for suggesting a workaround.  However, in my humble opinion, the workaround feels laborious in comparison to just specifying numerical repetition quantifiers.  In addition, I feel it may become more complicated when having to workaround more complex regular expressions that nest numerical quantifiers among greedy ones, like this: 
{noformat}
(([[:alnum:]]{10,11}\.){3,5})+
{noformat}

Since you considered numerical repetition quantifiers as an "extension" to leex's regex engine, I looked into [AWK's regex specification](http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_06) again more closely and found that numerical repetition quantifiers (also known as "BREs Matching Multiple Characters") are part of AWK's general (or _basic_) regular expression syntax (and not the extended syntax as I had previously assumed --- I was wrong about that; sorry for the confusion).

As a result, this is all the more reason for leex to support numerical repetition quantifiers (or "BREs Matching Multiple Characters") because they're part of AWK's basic regular expressions.

Thanks for your consideration.
OTP-Maintainer commented 9 years ago

jlouis said:

I'm calling "isn't really needed". Usually, you would just tokenize your product_id blindly and then handle the verification at a later stage, either in the parser or even later in the stages. It is also this reason there has been no pressing need for such an extension. The {m} and {m,n} repetitions are fairly easy to implement by unrolling the DFA construction the right number of times, but I'm sure it would be that useful.
OTP-Maintainer commented 9 years ago

sunaku said:

Alright, what about the fact that this "extension" seems to have been already implemented [1] but partially commented-out [2] in the codebase?

Would you be open to accepting a patch / pull request if I restored this "extension" to proper working order?

Thanks for your consideration.

[1] https://github.com/erlang/otp/blob/maint/lib/parsetools/src/leex.erl#L138
[2] https://github.com/erlang/otp/blob/maint/lib/parsetools/src/leex.erl#L692-L700
OTP-Maintainer commented 9 years ago

kenneth said:

Changed type from Bug to New feature since we don't think this is a bug. The leex documentation does not mention anything about support for repetition quantifiers.
OTP-Maintainer commented 8 years ago

rvirding said:

A major problem with this is the \{M\} pair is already in use for macro expansion. How would this be handled in that case? Doing a \ \{ \ \} would not work as this is needed now to quote the \{ and \} .

Also it would break consistency if some control characters apply with \ and some don't.

So I am against this suggestion.
OTP-Maintainer commented 8 years ago

kenneth said:

As Robert Virding is the original developer of leex we follow his advice and reject this suggestion.
Ther reasons are then: 1) Would cause backward compatibility problems,
2) Can't see that it is important enough to motivate 1.
OTP-Maintainer commented 8 years ago

kenneth said:

See my last comment before this one.