Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.95k stars 554 forks source link

segfault on big regexp #2362

Closed p5pRT closed 17 years ago

p5pRT commented 24 years ago

Migrated from rt.perl.org#3689 (status was 'resolved')

Searchable as RT3689$

p5pRT commented 24 years ago

From @simoncozens

Created by simon@brecon.co.uk

Tested on 5.6.x and 5.7.x\, and bleadperl.

This code should test if a file is completely encoded in Japanese EUC​:

undef $/; $_ = \<>; s/[\x00-\x7F]//g; print length $_; print "\n\n"; # From the cookbook​: /^(?​:\x8E[\xA0-\xDF]|\x8F[\xA1-\xFE]{2}|[\xA1-\xFE]{2})+$/;

When $_ is roughly 21k or more\, it segfaults. The backtrace shows that the regular expression engine has gone catatonic​:

(gdb) bt #0 0x80cfa5d in S_regmatch () #1 0x80d1b16 in S_regmatch () #2 0x80d198c in S_regmatch () #3 0x80d282b in S_regmatch () #4 0x80d1b16 in S_regmatch () #5 0x80d198c in S_regmatch () #6 0x80d282b in S_regmatch () #7 0x80d1b16 in S_regmatch () #8 0x80d198c in S_regmatch () ... (lots) ... #32084 0x80d1688 in S_regmatch () #32085 0x80d15ba in S_regmatch () #32086 0x80cfa16 in S_regtry () #32087 0x80cef2e in Perl_regexec_flags ()

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.6.0: Configured by root at Thu Apr 13 08:15:45 JST 2000. Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration: Platform: osname=linux, osvers=2.3.99-pre3, archname=i686-linux uname='linux othersideofthe.earth.li 2.3.99-pre3 #3 smp sat mar 25 23:54:38 jst 2000 i686 unknown ' config_args='-d' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef Compiler: cc='cc', optimize='-O2', gccversion=2.95.2 20000313 (Debian GNU/Linux) cppflags='-fno-strict-aliasing -I/usr/local/include' ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' stdchar='char', d_stdstdio=define, usevfork=false intsize=4, longsize=4, ptrsize=4, doublesize=8 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt libc=/lib/libc-2.1.3.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.6.0: /usr/local/lib/perl5/5.6.0/i686-linux /usr/local/lib/perl5/5.6.0 /usr/local/lib/perl5/site_perl/5.6.0/i686-linux /usr/local/lib/perl5/site_perl/5.6.0 /usr/local/lib/perl5/site_perl/5.5.670 /usr/local/lib/perl5/site_perl . Environment for perl v5.6.0: HOME=/home/simon LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/simon/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/rhs/bin:/usr/local/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/kde/bin:/home/simon/bin PERL_BADLANG (unset) SHELL=/usr/local/bin/zsh ```
p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

At 08​:27 +0000 2000-08-15\, Simon Cozens wrote​:

When $_ is roughly 21k or more\, it segfaults. The backtrace shows that the regular expression engine has gone catatonic​:

Your stack probably blew due to heavy (read\, massive) recursion in the regex engine\, a known bug. (The regex engine can detect and avoid some potentially-stack-blowing situations\, but there are others that blind-side it.) You can check this by doing whatever it takes to persuade Linux to give the process a bigger stack\, and seeing if that lets you get your string beyond 21k.

(Please take any follow-up to the perlbug list​: I'm off on holiday.)

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

|> This code should test if a file is completely encoded in Japanese EUC​: |> |> undef $/; $_ = \<>; |> s/[\x00-\x7F]//g; |> print length $_; print "\n\n"; |> # From the cookbook​: |> /^(?​:\x8E[\xA0-\xDF]|\x8F[\xA1-\xFE]{2}|[\xA1-\xFE]{2})+$/; |> |> When $_ is roughly 21k or more\, it segfaults. The backtrace shows |> that the regular expression engine has gone catatonic​:

As Dominic pointed out\, it's your stack. When I run it with no system stack limit\, I get the error​:

  Complex regular subexpression recursion limit (32766) exceeded

Of course\, the regex is perfectly reasonable if you're thinking about "better" or "worse" alternatives\, but if you're thinking about backtracking\, you know that it'll have to save at least one state at every character\, due to the (...)*\, and more on the characters that are matched by the first two alternates\, due to other alternatives being still available for those matches.

However\, if you replace the (?​:...) with (?>...)\, you get rid of the need to save the alternative-related states (which isn't too much help if your document is 30k or above\, or larger than your stack.)

So\, you could try flushing your stack every so often\, with something like

  m{^(?>   (?> ...alternates... ){1\,15000}   )* $   }x;

This will let your stack get up to 15\,000 states and will then flush it. (Actually\, you'll end up with one state for each 15\,000\, so if you might still have troubles if your document is absolutely huge).

BTW\, if anyone thinks it's totally wrong that the user needs to care about these very implementation details\, I certainly agree with you. But theory does little in the face of the reality of the situation\, and if Simon wants to get the job done using Perl and regular expressions\, this kind of knowledge is required. Of course\, maybe regular expressions aren't the best hammer for this nail\, and perhaps even Perl is not suited for it. C'est la vie. Maybe try COBOL? (Actually\, since a DFA would have no troubles with this\, you might try lex.)

Here's another way that is less efficient\, but has no stack troubles​:

  1 while m{\G(?>[\x00-\x7F]|\x8E[\xA0-\xDF]|\x8F[\xA1-\xFE][\xA1-\xFE]|[\xA1-\xFE][\xA1-\xFE])/gc;

  print "all EUC\n" if pos() == length($_);

If you want to tempt fate slightly\, but increase efficiency\, you can throw a + after the [\x00-\x7F]\, which will suck up all ASCII that might happen to be in a row. This uses stack\, but won't trigger the recursion limit "error" that we saw above. It makes it about 3x faster in my simple tests.

  Jeffrey


Jeffrey Friedl \jfriedl@&#8203;yahoo\-inc\.com Yahoo! Finance http​://finance.yahoo.com O'Reilly's Regular Expression book​: http​://public.yahoo.com/~jfriedl/regex/

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Jeffrey Friedl (lists.p5p)​:

BTW\, if anyone thinks it's totally wrong that the user needs to care about these very implementation details\, I certainly agree with you. But theory does little in the face of the reality of the situation\, and if Simon wants to get the job done using Perl and regular expressions\, this kind of knowledge is required.

No\, no\, no\, no\, no.

You have the knowledge and so you know why this segfaults and how to avoid it.

Joe User doesn't care *why* it segfaults. It *SHOULDN'T* segfault. He therefore shouldn't need to avoid it. The regex looks reasonable because it's easily the most intuitive way to understand the problem and Perl shouuld support that most intuitive way. And I really don't care *how* Perl does that\, because I shouldn't have to.

The bug is in the code\, not the understanding; leave the understanding as it is and fix the code.

I hope I'm being sufficiently unreasonable. :)

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

|> >BTW\, if anyone thinks it's totally wrong that the user needs to care about |> >these very implementation details\, I certainly agree with you. But theory |> >does little in the face of the reality of the situation\, and if Simon wants |> >to get the job done using Perl and regular expressions\, this kind of |> >knowledge is required. |> |> No\, no\, no\, no\, no. |> |> You have the knowledge and so you know why this segfaults and how to |> avoid it.

Exactly. Lucky me. Too bad for Joe User\, who (unless they read my book) doesn't know this\, thinks Perl is a piece of crap\, and goes off to bang his head against Visual Basic. /-​:

|> Joe User doesn't care *why* it segfaults. It *SHOULDN'T* segfault.

And we shouldn't be charged for government services based upon a totally irrelevent thing like\, say\, how much we earn\, but the reality is that's how it works.

I'm all for making the underlying implementation irrelevent as far as the User is concerned\, but until that happens (and I think it will never happen completely)\, I feel that knowing how do deal with it is a useful skill. There's nothing about the kludgy-but-working solution that I posted that will be broken by a different underlying regular-expression implementation. The key point is that it works today.

Maybe you're thinking that since this is p5p we[*] shouldn't feel hamstrung by today's reality\, but rather try to design for the future. I agree that we should plan on solving the problems we have today\, but the reality is that it will take a lot of work to solve these problems\, and it won't likely get done for a long time. So how long do we wait before we cave in to reality?

[*] I use the word "we" very loosly\, since I've done nothing to deserve   the honor of being considered a Perl Porter\, except perhaps the patch   I sent Larry circa 1992 about doubles needing to be 8-byte aligned   on some architectures.

|> The bug is in the code\, not the understanding; leave the understanding |> as it is and fix the code.

That would be wonderful! Send patches to Jarkko. :-)


Note that (I think) this is an entirely unrelated thread to the better/worse vs. backtracking thread from several days ago. That was about how much of the current semantics of Perl's regex language we can change out from under the user. This one is merely on pragmatics\, I think.

  Jeffrey


Jeffrey Friedl \jfriedl@&#8203;yahoo\-inc\.com Yahoo! Finance http​://finance.yahoo.com

p5pRT commented 24 years ago

From [Unknown Contact. See original ticket]

Late the other night\, I wrote​: |> |> You have the knowledge and so you know why this segfaults and how to |> |> avoid it. |> |> Exactly. Lucky me. Too bad for Joe User\, who (unless they read my book) |> doesn't know this\, thinks Perl is a piece of crap\, and goes off to bang |> his head against Visual Basic. /-​:

Reading this in the light of the next day\, I think I come across as arrogant in that paragraph. I didn't intend it to be that way (since I don't feel that way -- no one has to read a book I wrote to understand this stuff)\, so in case anyone took it that way\, I'm sorry about the poor choice of wording.

The point I intended to make is that those that don't know about backtracking will have no idea why something that should work doesn't.

While I agree that it would be nice to remove limitations imposed by the implementation\, pragmatically speaking\, I don't think it will happen any time soon. Perl regular expressions are specifically advertised to work via backtracking\, and so I think that an understanding of that issue can help users get around "bugs" that won't get fixed.

  Jeffrey

p5pRT commented 17 years ago

From @demerphq

This is fixed in bleadperl.

p5pRT commented 17 years ago

@demerphq - Status changed from 'open' to 'resolved'