delph-in / erg

English Resource Grammar
MIT License
17 stars 3 forks source link

Update REPP patterns for nested sets #17

Closed goodmami closed 1 year ago

goodmami commented 5 years ago

All unescaped bracket characters in REPP regexes should be escaped to avoid ambiguity in future versions of regex libraries.

More info:

There are potential changes to how character sets in regex are defined that would allow for nested sets and set operations (see https://unicode.org/reports/tr18/#Subtraction_and_Intersection). For example, if you want the letters 'a' through 'z' but without 'q', the following are equivalent (as I understand):

[a-pr-z]
[[a-z]--[q]]
[a-z--q]

...where -- is the set-difference operator.

The middle pattern uses nested sets, but in order for regex libraries to accommodate these patterns, the old behavior of allowing unescaped brackets in character classes needs to change. An example of the current behavior is the following patterns for defining a character class of the bracket characters [ and ]:

[][]
[[\]]
[\[\]]

Currently we have some unescaped brackets in REPP files like rpp/wiki.rpp:

!\[\[(?:[^[|\]]+\|)?([^[|\]]+)\]\]
          ^here        ^and here

Recent versions of Python raises warnings when encountering unescaped brackets in character classes, so PyDelphin's REPP processor throws a lot of warnings (as seen in @arademaker's output here: delph-in/pydelphin#250). I'm not sure if the regex libraries for Lisp (LKB), C (ACE), C++ (PET), or C# (agree) will similarly see a change in behavior, but the defensive thing to do is to escape all brackets in character classes.

arademaker commented 5 years ago

In Lisp, the most well-know regex library is cl-ppcre (used by LKB), for it, the two variations of your example are consider equivalent:

CL-USER> (cl-ppcre:parse-string "!\\[\\[(?:[^[|\\]]+\\|)?([^[|\\]]+)\\]\\]")
(:SEQUENCE "![["
 (:GREEDY-REPETITION 0 1
  (:GROUP
   (:SEQUENCE (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\[ #\| #\]))
    #\|)))
 (:REGISTER (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\[ #\| #\]))) "]]")
CL-USER> (cl-ppcre:parse-string "!\\[\\[(?:[^\\[|\\]]+\\|)?([^\\[|\\]]+)\\]\\]")
(:SEQUENCE "![["
 (:GREEDY-REPETITION 0 1
  (:GROUP
   (:SEQUENCE (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\[ #\| #\]))
    #\|)))
 (:REGISTER (:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\[ #\| #\]))) "]]")
CL-USER> (equal * **)
T
goodmami commented 5 years ago

@arademaker good, this lends support to always escaping brackets in character classes. thanks for confirming

goodmami commented 4 years ago

This issue is becoming more pressing for me, as I cannot get the same output as Bec's REPP implementation with PyDelphin. If I use Python's standard re module, group-local inline flags apply globally, and if I use the 3rd party regex library it fails to compile patterns with unescaped brackets in character classes (which are nested sets).

I attach a patch that escapes all unescaped brackets in the REPP modules loaded by pet/repp.set.

repp-charclass-fix.patch.txt

(GitHub won't accept a .patch extension so I added .txt, but this is the output of svn diff)

edit: I made PyDelphin fall back to the old style regular expressions if unescaped brackets probably caused an error. As long as a rule doesn't use both unescaped brackets and advanced features like inline flags, it should be able to handle the rules. Nevertheless, I'd like to fix the unescaped brackets in the ERG, too.

oepen commented 4 years ago

i would like to quibble with your use of 'fix' in the above, @goodmami: i would like to think that the 'REPP specification' says that it assumes the specific variety of perl-compatible regular expressions. so, the relevant specification is the perl documentation: [][] is a valid perl-compatible character class (matching square brackets) and, in principle, a compliant REPP implementation should support that.

in your case, it sounds as if python does not make available a perl-compliant regular expression engine (possibly because it anticipates a future extension of the language). one might consider that a python deficiency, hence it does not seem unreasonable to me (in principle at least) that you need to work around that python limitation in your REPP implementation.

personally (because regular expressions are hard to write, and even harder to read later on), i have adopted certain stylistic conventions, which include avoidance of unnecessary (redundant) clutter. you are proposing that the ERG rewrite the above as the equivalent [\[\]]—i would likely find that at least a bit confusing, if and when i look at the ERG regular expressions sometime down the road.

this is not a big issue, but i am pushing the point as a matter of principle. seeing as you have a functional work-around in pyDelphin now, could we stick to perl-compatible regular expression for REPP, without a rather nitty-gritty caveat regarding square brackets inside of bracketed character classes?

goodmami commented 4 years ago

Thanks for the feedback! Some things...

so, the relevant specification is the perl documentation

Except that PCRE is a C library inspired by Perl. They are in fact different, although mostly the same. Even libraries that claim to support PCRE do not always fully support it; CL-PPCRE, for instance, does not support recursion.

Also, further down in that same Perl documentation file, see that character set operations are an experimental feature. Unfortunately the syntax is slightly different from the Unicode Technical Standard report.

one might consider that a python deficiency

While Perl was a leader in defining a pattern-matching sublanguage inspired by formal regular expressions (which take on the name even if they are no longer regular), I don't think that Perl's syntax is the platonic ideal that all other implementation should strive for, and Python seems to have made different design decisions. I don't know that it's fair to call it a deficiency. I admit that occasionally there are features in PCRE that I wish Python's re module had, such as unicode properties. Unescaped brackets in character classes are not a feature I particularly cherish.

you need to work around that python limitation in your REPP implementation

I did. I use the 3rd-party regex module if available, which is much closer to PCRE than the standard re module. However, in its most featureful mode, the unescaped brackets in character classes generate errors, not just warnings.

[\[\]]—i would likely find that at least a bit confusing

Personally I find that the original [][] requires more explaining ("] terminates a character class unless it is the first character in the class...") and that the escaped form is clearer, even if it requires an extra character or two. But this is subjective. Besides, you can always write a comment if you think you'll forget.

could we stick to perl-compatible regular expression for REPP

Yes, but it would be prudent to be somewhat conservative in the PCRE features we make use of, given that not all implementation (even supposedly PCRE ones) equally implement all features. The escaped brackets inside character classes are indeed valid in PCRE, and furthermore they do not cause some (either "deficient" or "forward-thinking") engines to raise warnings or errors. Therefore the choice is:

  1. Save two characters and impede the participation of some engines
  2. Use two more characters and be inclusive of others

Lastly, note that this is an issue of the ERG and not REPP in general. I'm not proposing that we change the REPP specification to disallow some regex features, I'm suggesting that the ERG make a small, semantically-equivalent change so as to be more compatible with different implementations.

fcbond commented 4 years ago

G'day,

I think in this case (and more generally), if a small change in how we write regular expressions makes it easier for the grammar to be used in many tools, then the extra cognitive load of writing [[]]] rather than [[]] seems worth it to me. Especially as there seems to be some variation in how relatively standard libraries handle these.

On Thu, Jul 16, 2020 at 2:01 AM Michael Wayne Goodman < notifications@github.com> wrote:

Thanks for the feedback! Some things...

so, the relevant specification is the perl documentation

Except that PCRE is a C library inspired by Perl. They are in fact different https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions#Differences_from_Perl, although mostly the same. Even libraries that claim to support PCRE do not always fully support it; CL-PPCRE, for instance, does not support recursion https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines#Language_features .

Also, further down in that same Perl documentation file, see that character set operations are an experimental feature https://perldoc.perl.org/perlrecharclass.html#Extended-Bracketed-Character-Classes. Unfortunately the syntax is slightly different from the Unicode Technical Standard report.

one might consider that a python deficiency

While Perl was a leader in defining a pattern-matching sublanguage inspired by formal regular expressions (which take on the name even if they are no longer regular), I don't think that Perl's syntax is the platonic ideal that all other implementation should strive for, and Python seems to have made different design decisions. I don't know that it's fair to call it a deficiency. I admit that occasionally there are features in PCRE that I wish Python's re module had, such as unicode properties. Unescaped brackets in character classes are not a feature I particularly cherish.

you need to work around that python limitation in your REPP implementation

I did. I use the 3rd-party regex https://pypi.org/project/regex/ module if available, which is much closer to PCRE than the standard re module. However, in its most featureful mode, the unescaped brackets in character classes generate errors, not just warnings.

[[]]—i would likely find that at least a bit confusing

Personally I find that the original [][] requires more explaining ("] terminates a character class unless it is the first character in the class...") and that the escaped form is clearer, even if it requires an extra character or two. But this is subjective. Besides, you can always write a comment if you think you'll forget.

could we stick to perl-compatible regular expression for REPP

Yes, but it would be prudent to be somewhat conservative in the PCRE features we make use of, given that not all implementation (even supposedly PCRE ones) equally implement all features. The escaped brackets inside character classes are indeed valid in PCRE, and furthermore they do not cause some (either "deficient" or "forward-thinking") engines to raise warnings or errors. Therefore the choice is:

  1. Save two characters and impede the participation of some engines
  2. Use two more characters and be inclusive of others

Lastly, note that this is an issue of the ERG and not REPP in general. I'm not proposing that we change the REPP specification to disallow some regex features, I'm suggesting that the ERG make a small, semantically-equivalent change so as to be more compatible with different implementations.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/delph-in/erg/issues/17#issuecomment-658913130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRWNSEMQZAN2HCHOMWLR3XUURANCNFSM4IMDLEKA .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

oepen commented 4 years ago

belatedly, @goodmami: i apologize for suggesting that Python was deficient in its support for regular expressions! i should have said that its built-in RE module and the replacement REGEX module may impose limitations on building a fully compliant REPP interpreter in Python. incidentally, would the Python bindings for PCRE be an alternative for pyDelphin?

what i wanted to clarify is that the REPP regular expression syntax is defined through compatibility with Perl, so users should be able to turn to the perlre man pages as the canonical reference. for this reason, i am trying to minimize the range of additional constraints that different implementations impose on the REPP language. i have now tried to clarify that space on what i consider the official documentation, the REPP wiki page. i hope that clarification makes sense to you? i plan on adding escaping of those square brackets to the ERG rules in the current revision cycle.

my quibble in the above was in response to this issue suggesting that escaping should be applied to all square brackets, which would constitute a fix of the ERG. i believe we have now agreed that the proposed revision rather is an accommodation of the REPP implementation in pyDelphin, which is a subtle but important distinction (to me at least).

in general, i have at times felt we could be more careful in the DELPH-IN community about telling others how to go about their work, including aesthetic judgements and stylistic variation or conventions. in the above, i invoked my own cognitive load and coding conventions in maintaining the REPP rules in the ERG (which feel like an intricate space to me), to explain why i did not just embrace the proposed change.

goodmami commented 4 years ago

apology accepted :) likewise, I'm sorry if my response was too defensive.

incidentally, would the Python bindings for PCRE be an alternative for pyDelphin?

A good suggestion, but unfortunately that project posted a couple releases in 2015 and nothing before or since. I'd say it's dead. Worse, issues on its GitHub page suggest that it has hard crashes for a number of users.

what i wanted to clarify is that the REPP regular expression syntax is defined through compatibility with Perl, so users should be able to turn to the perlre man pages as the canonical reference. for this reason, i am trying to minimize the range of additional constraints that different implementations impose on the REPP language.

Right. I wasn't contesting the specification of REPP's regex flavor so much as advocating for the ERG to try and use a more portable subset of regex features where possible. If we embrace the most recent, most featureful regex implementation, it's too high a bar even for many "PCRE-compatible" implementations.

And secondly I was pointing out that Perl and PCRE are not always the same. If Perl's documentation is the reference for everything possible in REPP, then nothing but Perl can satisfy the requirement (e.g., consider the (?{...}) construct that evaluates Perl code inside the match). PCRE also has features that Perl lacks, such as \Q...\E quoted substrings. Here is some more information: http://www.pcre.org/current/doc/html/pcre2compat.html. With this in mind, I think the ReppTop wiki's current specification of Perl regexes with advice to avoid potentially incompatible patterns is on the right track.

i have now tried to clarify that space on what i consider the official documentation, the REPP wiki page. i hope that clarification makes sense to you? i plan on adding escaping of those square brackets to the ERG rules in the current revision cycle.

Yes, thanks, these clarifications are an improvement. And I now (perhaps grudgingly) accept your rationale for the \p{...} patterns. To be honest, I initially did not see their purpose (as explained in the email thread) except (withheld from the email) as a shibboleth to exclude non-PCRE implementations, but your quote from the Perl docs was enlightening. Even if \w and friends are fully unicode-aware (not locale-dependent nor ASCII) in Python 3 by default, I understand that other engines might not work the same. This means that PyDelphin users working with the ERG will need to get the 3rd-party regex library, so I appreciate your willingness to budge on the [][] issue. I'd say this is a fair compromise :)

my quibble in the above was in response to this issue suggesting that escaping should be applied to all square brackets, which would constitute a fix of the ERG. i believe we have now agreed that the proposed revision rather is an accommodation of the REPP implementation in pyDelphin, which is a subtle but important distinction (to me at least).

Yes, I now see how my use of 'fix' could be misinterpreted. I intended it as fixing the problem I was experiencing. The ERG wasn't broken according to our specs, but it could be more portable, or accommodating, to different DELPH-IN tools.

in general, i have at times felt we could be more careful in the DELPH-IN community about telling others how to go about their work, including aesthetic judgements and stylistic variation or conventions. in the above, i invoked my own cognitive load and coding conventions in maintaining the REPP rules in the ERG (which feel like an intricate space to me), to explain why i did not just embrace the proposed change.

Understood. I, too, have been unforgiving at times in my recent quests to bring precision and resolution to our "informalisms". I hope in the process I have not been insensitive to others' hard work.

danflick commented 3 years ago

Okay, I have made the changes in trunk to the ERG's rpp files identified in your repp-charclass-fix.patch.txt above, @goodmami, with thanks for providing the list. Please check once the changes are checked after today.

goodmami commented 3 years ago

@danflick I have tested REPP files in the ERG's trunk branch, and I no longer get the warnings/errors about unescaped brackets (aside: it also no longer raises an error about the reuse of the #1...\1 internal group in tokenizer.rpp). Now PyDelphin can use the ERG's REPP modules without modification:

$ delphin repp -ftriple -c ~/grammars/erg-trunk/pet/repp.set <<< "Abrams didn't say, \"hi\""
(0, 6, Abrams)
(7, 10, did)
(10, 13, n’t)
(14, 17, say)
(17, 18, ,)
(19, 20, “)
(20, 22, hi)
(22, 23, ”)

Thanks!