Perl / perl5

šŸŖ The Perl programming language
https://dev.perl.org/perl5/
Other
1.92k stars 551 forks source link

regex: \c inside (?[]) causes panics and unexpected behavior #14934

Closed p5pRT closed 8 years ago

p5pRT commented 9 years ago

Migrated from rt.perl.org#126181 (status was 'resolved')

Searchable as RT126181$

p5pRT commented 9 years ago

From victor@drawall.cc

Created by @Grimy

* /(?[(\c]) / panics​: Read past end of '(?[ ])'.   It should report a syntax error.

* /(?[\c#])/\, /(?[\c[])/\, /(?[\c\])/ and /(?[\c]])/ all report a syntax error.   They should each match a single control character.

* /(?[(\c])/ matches a single "\c]".   It should report a syntax error.

* /(?[(\c]) ]\b/ behaves like /\c]b/.   It should report a syntax error.

* /(?[\c[]](])/ behaves like /\c[\]/.   It should report a syntax error.

* /(?[\c#] ])/ (literal newline inside) panics​: reg_node overrun trying to emit 0\, 171b5f4>=171b5f0   It should report a syntax error.

All of these bugs were found on the current blead (2d9b5f101563ac9fee41e6ca496f79db6222d2e3). All but the last one were also present in 5.20.2.

Perl Info ``` Flags: category=core severity=low Site configuration information for perl 5.23.4: Configured by grimy at Tue Sep 22 21:18:14 CEST 2015. Summary of my perl5 (revision 5 version 23 subversion 4) configuration: Commit id: 2d9b5f101563ac9fee41e6ca496f79db6222d2e3 Platform: osname=linux, osvers=4.0.7-2-arch, archname=x86_64-linux uname='linux localhost 4.0.7-2-arch #1 smp preempt tue jun 30 07:50:21 utc 2015 x86_64 gnulinux ' config_args='-ds -e -Dusedevel' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' ccversion='', gccversion='5.1.0', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-unknown-linux-gnu/5.1.0/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib /lib64 /usr/lib64 libs=-lpthread -lnsl -lnm -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -lnsl -lnm -ldl -lm -lcrypt -lutil -lc libc=libc-2.21.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.21' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector-strong' @INC for perl 5.23.4: /usr/local/lib/perl5/site_perl/5.23.4/x86_64-linux /usr/local/lib/perl5/site_perl/5.23.4 /usr/local/lib/perl5/5.23.4/x86_64-linux /usr/local/lib/perl5/5.23.4 /usr/local/lib/perl5/site_perl . Environment for perl 5.23.4: HOME=/home/grimy LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/grimy/bin:/home/grimy/.nvim/scripts:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/plan9/bin PERL_BADLANG (unset) SHELL (unset) ```
p5pRT commented 9 years ago

From victor@drawall.cc

The attached patch works-for-meā„¢. Should I add new tests checking for this bug in another patch?

p5pRT commented 9 years ago

From @jkeenan

On Wed Sep 30 09​:15​:21 2015\, victor@​drawall.cc wrote​:

The attached patch works-for-meā„¢. Should I add new tests checking for this bug in another patch?

It appears that the patch did not get attached. Can you re-try?

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 9 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 9 years ago

From victor@drawall.cc

Sorry for the mess-up. Hereā€™s the patch again. Iā€™m also including it in the body of the email\, just in case the attachment fails again.

From 96cb469f3930a276be8cedac7d23e9a26899be08 Mon Sep 17 00​:00​:00 2001 From​: Victor Adam \victor@​drawall\.cc Date​: Sun\, 27 Sep 2015 10​:22​:08 +0200 Subject​: [PATCH] regex​: handle \cX inside (?[])

The \cX notation for control characters used to cause panics and unexpected behavior when used insed an extended character class. See bug #126181.

The solution is to ignore the byte following \c during the first parsing pass of a (?[]) construct.


AUTHORS | 1 + regcomp.c | 4 ++++ 2 files changed\, 5 insertions(+)

Inline Patch ```diff diff --git a/AUTHORS b/AUTHORS index 451c707..ebd9222 100644 --- a/AUTHORS +++ b/AUTHORS @@ -1222,6 +1222,7 @@ Unicode Consortium Vadim Konovalov Valeriy E. Ushakov Vernon Lyon +Victor Adam Victor Efimov Viktor Turskyi Ville SkyttƤ diff --git a/regcomp.c b/regcomp.c index 4f4bb44..a8e80ee 100644 --- a/regcomp.c +++ b/regcomp.c @@ -13454,6 +13454,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t ```

*pRExC_state, SV* return_invlist,   * default​: case next time and keep on incrementing until   * we find one of the invariants we do handle. */   RExC_parse++; + if (*RExC_parse == 'c') { + /* Skip the \cX notation for control characters \/ + RExC_parse++; + }   break;   case '['​:   { -- 2.4.5

p5pRT commented 9 years ago

From victor@drawall.cc

0001-regex-handle-cX-inside.patch ```diff From 96cb469f3930a276be8cedac7d23e9a26899be08 Mon Sep 17 00:00:00 2001 From: Victor Adam Date: Sun, 27 Sep 2015 10:22:08 +0200 Subject: [PATCH] regex: handle \cX inside (?[]) The \cX notation for control characters used to cause panics and unexpected behavior when used insed an extended character class. See bug #126181. The solution is to ignore the byte following \c during the first parsing pass of a (?[]) construct. --- AUTHORS | 1 + regcomp.c | 4 ++++ 2 files changed, 5 insertions(+) diff --git a/AUTHORS b/AUTHORS index 451c707..ebd9222 100644 --- a/AUTHORS +++ b/AUTHORS @@ -1222,6 +1222,7 @@ Unicode Consortium Vadim Konovalov Valeriy E. Ushakov Vernon Lyon +Victor Adam Victor Efimov Viktor Turskyi Ville SkyttƤ diff --git a/regcomp.c b/regcomp.c index 4f4bb44..a8e80ee 100644 --- a/regcomp.c +++ b/regcomp.c @@ -13454,6 +13454,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, * default: case next time and keep on incrementing until * we find one of the invariants we do handle. */ RExC_parse++; + if (*RExC_parse == 'c') { + /* Skip the \cX notation for control characters */ + RExC_parse++; + } break; case '[': { -- 2.4.5 ```
p5pRT commented 9 years ago

From @khwilliamson

On 09/30/2015 02​:21 AM\, Victor ADAM wrote​:

The attached patch works-for-meā„¢. Should I add new tests checking for this bug in another patch?

Yes\, please. The easiest place to put them is in t/re/re_tests. Traditionally\, people have just added them towards the end. It can handle things that don't compile. It would be nice if you included a test for every one in your ticket. There may be a problem in placing them there\, though\, in that there may be warnings about this being an experimental feature. Then it might be best to put them in t/re/regex_sets.t. I haven't looked.

Your patch looks good to me. I would split the author portion into its own patch so that in the unlikely event that the other one should ever be reverted\, your name will stay as a contributor. The tests can go in the code patch. Theoretically they should be in a separate patch marked as TODO and then changed to not-TODO in the code patch. But hardly anyone bothers to do that\, as the gain in my opinion is not worth the extra work\, and apparently others agree.

Karl Williamson

p5pRT commented 9 years ago

From victor@drawall.cc

As discussed\, Iā€™ve split my patch in two : 0001*.patch adds my name to the AUTHORS file\, while 0003*.patch is the actual code change. Iā€™m also attaching 0002*.patch that adds tests for RT #126177\, #126178\, #126179\, #126180\, #126181\, #126185\, #126186\, #126187\, #126222 and #126253.

p5pRT commented 9 years ago

From victor@drawall.cc

0001-AUTHORS-Add-myself.patch ```diff From 4c62e3b50838f5d7c1f5320046740d1f500b1737 Mon Sep 17 00:00:00 2001 From: Victor Adam Date: Fri, 2 Oct 2015 19:38:25 +0200 Subject: [PATCH 1/3] AUTHORS: Add myself --- AUTHORS | 1 + 1 file changed, 1 insertion(+) diff --git a/AUTHORS b/AUTHORS index 451c707..ebd9222 100644 --- a/AUTHORS +++ b/AUTHORS @@ -1222,6 +1222,7 @@ Unicode Consortium Vadim Konovalov Valeriy E. Ushakov Vernon Lyon +Victor Adam Victor Efimov Viktor Turskyi Ville SkyttƤ -- 2.5.3 ```
p5pRT commented 9 years ago

From victor@drawall.cc

0002-test-Add-regex-tests.patch ```diff From 648d5fc4a052dab62bdcc9f28af234b0ab163063 Mon Sep 17 00:00:00 2001 From: Victor Adam Date: Fri, 2 Oct 2015 19:35:19 +0200 Subject: [PATCH 2/3] test: Add regex tests This adds tests for RT tickets #126177, #126178, #126179, #126180, #126185, #126181, #126186, #126187 and #126222. --- t/re/re_tests | 45 +++++++++++++++++++++++++++++++++++++++++++++ t/re/regex_sets.t | 19 +++++++++++++++++++ t/re/regexp.t | 1 + 3 files changed, 65 insertions(+) diff --git a/t/re/re_tests b/t/re/re_tests index 9d5fa73..7d83ee1 100644 --- a/t/re/re_tests +++ b/t/re/re_tests @@ -1926,5 +1926,50 @@ A+(*PRUNE)BC(?{}) AAABC y $& AAABC /(a+){1}+a/ aaa n - - # [perl #125825] +# RT #126178: these should cause a syntax error +(?i Bc - Unmatched ( +(?^ Bc - Unmatched ( +(?- Bc - Unmatched ( +(?a-a Bc - Unmatched ( + +# RT #126177: (?n) should disable numbered captures +(?n)(abc) abc y -$1 - +'(?-n)(abc)'n abc y -$1 -abc + +# RT #126179: \N{} should cause an error (even when skipping the lexer) +[\N{}] Bc - Zero length \N{} + +# RT #126180: those segfault inside (?[]) +[\ &!] Bsc - Incomplete expression +[\ +!] Bsc - Incomplete expression +[\ -!] Bsc - Incomplete expression +[\ ^!] Bsc - Incomplete expression +[\ |!] Bsc - Incomplete expression + +# RT #126185: (?-p) should be an error +(?-p) Bc - Regexp modifier \"a\" may not appear after the \"-\" +'(?-p)'p Bc - Regexp modifier \"a\" may not appear after the \"-\" + +# RT #126186: (*ACCEPT:arg) should either be an error or set $REGMARK correctly +(*MARK:arg) y $REGMARK arg +(*ACCEPT:arg) By $REGMARK arg + +# RT #126187: these produce warnings (even when skipping the tests!) +# \p c - Can't find Unicode property TODO +# \p^ c - Can't find Unicode property TODO +# \P c - Can't find Unicode property TODO +# \P^ c - Can't find Unicode property TODO + +# RT #126222 +(?(?!)) By - - + +# RT #126253 +.{1}?? Bc - Nested quantifiers +.{1}?+ Bc - Nested quantifiers +.{1}+? c - Nested quantifiers +.{1}++ c - Nested quantifiers +.{1}*? c - Nested quantifiers +.{1}*+ c - Nested quantifiers + # Keep these lines at the end of the file # vim: softtabstop=0 noexpandtab diff --git a/t/re/regex_sets.t b/t/re/regex_sets.t index ee161b2..185c70f 100644 --- a/t/re/regex_sets.t +++ b/t/re/regex_sets.t @@ -138,6 +138,25 @@ if (! is_miniperl() && locales_enabled('LC_CTYPE')) { } } +# RT #126181: \cX behaves strangely inside (?[]) +{ + no warnings qw(syntax regexp); + + eval { $_ = '/(?[(\c]) /'; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[(\c]) / should not panic'); + eval { $_ = '(?[\c#]' . "\n])"; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[(\c]) / should not panic'); + eval { $_ = '(?[(\c])'; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[(\c])/ should be a syntax error'); + eval { $_ = '(?[(\c]) ]\b'; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[(\c]) ]\b/ should be a syntax error'); + eval { $_ = '(?[\c[]](])'; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[\c[]](])/ should be a syntax error'); + like("\c#", qr/(?[\c#])/, '\c# should match itself'); + like("\c[", qr/(?[\c[])/, '\c[ should match itself'); + like("\c\ ", qr/(?[\c\])/, '\c\ should match itself'); + like("\c]", qr/(?[\c]])/, '\c] should match itself'); +} done_testing(); diff --git a/t/re/regexp.t b/t/re/regexp.t index f27a027..cbac03f 100644 --- a/t/re/regexp.t +++ b/t/re/regexp.t @@ -99,6 +99,7 @@ use strict; use warnings FATAL=>"all"; use vars qw($bang $ffff $nulnul); # used by the tests use vars qw($qr $skip_amp $qr_embed $qr_embed_thr $regex_sets); # set by our callers +use vars qw($REGMARK); # set from tested patterns -- 2.5.3 ```
p5pRT commented 9 years ago

From victor@drawall.cc

0003-regex-handle-cX-inside.patch ```diff From afd2df3109b876d25120b34a47fc6d952d99380d Mon Sep 17 00:00:00 2001 From: Victor Adam Date: Sun, 27 Sep 2015 10:22:08 +0200 Subject: [PATCH 3/3] regex: handle \cX inside (?[]) The \cX notation for control characters used to cause panics and unexpected behavior when used insed an extended character class. See bug #126181. The solution is to ignore the byte following \c during the first parsing pass of a (?[]) construct. --- regcomp.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/regcomp.c b/regcomp.c index 4f4bb44..a8e80ee 100644 --- a/regcomp.c +++ b/regcomp.c @@ -13454,6 +13454,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, * default: case next time and keep on incrementing until * we find one of the invariants we do handle. */ RExC_parse++; + if (*RExC_parse == 'c') { + /* Skip the \cX notation for control characters */ + RExC_parse++; + } break; case '[': { -- 2.5.3 ```
p5pRT commented 9 years ago

From victor@drawall.cc

Thanks for the pointers! I already love re_tests\, it makes it really easy to add new tests.

Unfortunately\, it doesnā€™t look like I can test the last example using re_tests. A literal newline will be interpreted as a pattern separator\, and the panic doesnā€™t occur with a "\n".

Anyway\, Iā€™m working right now on adding tests for this and the other bugs I reported (#126141\, #126177\, #126178\, #126179\, #126180\, #126185\, #126186\, #126187\, #126222). I was thinking about making a single patch with all new tests\, and then separate code patches for each bug; is that a good idea?

p5pRT commented 9 years ago

From victor@drawall.cc

As discussed\, Iā€™ve split my patch in two : 0001*.patch adds my name to the AUTHORS file\, while 0003*.patch is the actual code change. Iā€™m also attaching 0002*.patch that adds tests for RT #126177\, #126178\, #126179\, #126180\, #126181\, #126185\, #126186\, #126187\, #126222 and #126253.

p5pRT commented 9 years ago

From @khwilliamson

On 10/03/2015 12​:25 AM\, Victor ADAM wrote​:

As discussed\, Iā€™ve split my patch in two : 0001*.patch adds my name to the AUTHORS file\, while 0003*.patch is the actual code change. Iā€™m also attaching 0002*.patch that adds tests for RT #126177\, #126178\, #126179\, #126180\, #126181\, #126185\, #126186\, #126187\, #126222 and #126253.

One of our rules is that individual patches can't break blead on a major platform. When I test your 2nd patch\, but not the third\, I get the following error​:

Syntax error in (?[...]) in regex m/(?[\c#])/ at re/regex_sets.t line 155. re/regex_sets.t ................................................... Dubious\, test returned 255 (wstat 65280\, 0xff00) No subtests run

I'm afraid this patch was problematic to apply\, as you had the misfortune to have someone else be modifying re_tests\, so that your patch no longer works as-is because the context surrounding it has changed. This happens occasionally\, and it's unfortunate it happened to you on the first try. Before submitting again\, you should rebase your patch to the latest blead. You'll get conflicts that you need to resolve\, but this shouldn't be too hard. If you need help the #p5p irc channel will have people who can guide you.

And thanks for doing all this.

p5pRT commented 8 years ago

From victor@drawall.cc

Sorry for the long delay. Iā€™ve rebased my work on top of the current blead. I also merged the test and code commit into a single one. All tests are successful. The resulting patches are attached.

p5pRT commented 8 years ago

From victor@drawall.cc

0001-AUTHORS-Add-myself.patch ```diff From 969a278ddf8f298c12df76ace52ee5af27814cc6 Mon Sep 17 00:00:00 2001 From: Victor Adam Date: Fri, 2 Oct 2015 19:38:25 +0200 Subject: [PATCH 1/2] AUTHORS: Add myself --- AUTHORS | 1 + 1 file changed, 1 insertion(+) diff --git a/AUTHORS b/AUTHORS index 451c707..ebd9222 100644 --- a/AUTHORS +++ b/AUTHORS @@ -1222,6 +1222,7 @@ Unicode Consortium Vadim Konovalov Valeriy E. Ushakov Vernon Lyon +Victor Adam Victor Efimov Viktor Turskyi Ville SkyttƤ -- 2.5.3 ```
p5pRT commented 8 years ago

From victor@drawall.cc

0002-regex-handle-cX-inside.patch ```diff From 4e40d5385b4ac0dd89f159ef411f90068b39b3b4 Mon Sep 17 00:00:00 2001 From: Victor Adam Date: Sun, 27 Sep 2015 10:22:08 +0200 Subject: [PATCH 2/2] regex: handle \cX inside (?[]) The \cX notation for control characters used to cause panics and unexpected behavior when used insed an extended character class. See bug #126181. The solution is to ignore the byte following \c during the first parsing pass of a (?[]) construct. --- regcomp.c | 4 ++++ t/re/regex_sets.t | 19 +++++++++++++++++++ 2 files changed, 23 insertions(+) diff --git a/regcomp.c b/regcomp.c index 540f71c..beec98d 100644 --- a/regcomp.c +++ b/regcomp.c @@ -13444,6 +13444,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist, * default: case next time and keep on incrementing until * we find one of the invariants we do handle. */ RExC_parse++; + if (*RExC_parse == 'c') { + /* Skip the \cX notation for control characters */ + RExC_parse++; + } break; case '[': { diff --git a/t/re/regex_sets.t b/t/re/regex_sets.t index 5c683ba..a5941ba 100644 --- a/t/re/regex_sets.t +++ b/t/re/regex_sets.t @@ -139,6 +139,25 @@ if (! is_miniperl() && locales_enabled('LC_CTYPE')) { } } +# RT #126181: \cX behaves strangely inside (?[]) +{ + no warnings qw(syntax regexp); + + eval { $_ = '/(?[(\c]) /'; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[(\c]) / should not panic'); + eval { $_ = '(?[\c#]' . "\n])"; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[(\c]) / should not panic'); + eval { $_ = '(?[(\c])'; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[(\c])/ should be a syntax error'); + eval { $_ = '(?[(\c]) ]\b'; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[(\c]) ]\b/ should be a syntax error'); + eval { $_ = '(?[\c[]](])'; qr/$_/ }; + like($@, qr/^Syntax error/, '/(?[\c[]](])/ should be a syntax error'); + like("\c#", qr/(?[\c#])/, '\c# should match itself'); + like("\c[", qr/(?[\c[])/, '\c[ should match itself'); + like("\c\ ", qr/(?[\c\])/, '\c\ should match itself'); + like("\c]", qr/(?[\c]])/, '\c] should match itself'); +} done_testing(); -- 2.5.3 ```
p5pRT commented 8 years ago

From @khwilliamson

Thanks\, applied as 4a84d6e89d52b8921090805085871e6cca66924d -- Karl Williamson

p5pRT commented 8 years ago

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT commented 8 years ago

@mauke - Status changed from 'pending release' to 'resolved'