Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.88k stars 530 forks source link

Some obscure Segmentation fault on a long file #20411

Closed AnFunctionArray closed 1 year ago

AnFunctionArray commented 1 year ago

I still haven't exactly figured out why but here is some debug info nevertheless:

2117462 <1];%ne> <xtern char>   |493641|  1269:SUSPEND(1291)
2117462 <1];%ne> <xtern char>   |493642|   1271:CURLYN[62]{0,INFTY}(1289)
                             |493642|   ANYOF[0-9@A-Z_a-z] can match 5 times out of 2147483647...
2117467 <xtern> < char s_zo>    |493643|    1289:SUCCEED(0)
                             |493643|    SUCCEED: subpattern success...
2117467 <xtern> < char s_zo>    |493641|  1291:SUCCEED(0)
                             |493641|  SUCCEED: subpattern success...
2117467 <xtern> < char s_zo>    |493640| 1293:BOUND(1294)
2117467 <xtern> < char s_zo>    |493640| 1294:CLOSE61 'identifierpure'(1296)
                             |493640| END: EVAL trying tail ... (cur_eval=1a8a38ee0)
2117467 <xtern> < char s_zo>    |493641|  4017:CLOSE190 'identf'(4019)
2117467 <xtern> < char s_zo>    |493641|  4019:LOGICAL[2](4020)
2117467 <xtern> < char s_zo>    |493641|  4020:EVAL(4023)
EVAL/GOSUB: Matching embedded REx "" against " char s_zone5_master__trigger_17_0018552c[1];%nextern char s"...
2117467 <xtern> < char s_zo>    |493642|   1:NOTHING(2)
2117467 <xtern> < char s_zo>    |493642|   2:END(0)
                             |493642|   END: EVAL trying tail ... (cur_eval=1a8a38ee0)
2117467 <xtern> < char s_zo>    |493643|    4023:SUSPEND(4029)
2117467 <xtern> < char s_zo>    |493644|     4025:STAR(4027)
                             |493644|     POSIXD[\s] can match 1 times out of 2147483647...
2117468 <tern > <char s_zon>    |493645|      4027:SUCCEED(0)
                             |493645|      SUCCEED: subpattern success...
2117468 <tern > <char s_zon>    |493643|    4029:CLOSE189 'identiiferfast'(4031)
                             |493643|    END: EVAL trying tail ... (cur_eval=1a8a32fa0)
2117468 <tern > <char s_zon>    |493644|     4037:LOGICAL[1](4038)
2117468 <tern > <char s_zon>    |493644|     4038:EVAL(4041)
2117468 <tern > <char s_zon>    |493645|      4041:IFTHEN(4151)
2117468 <tern > <char s_zon>    |493645|      4151:CLOSE191 'identifiercompositefast'(4153)
                             |493645|      END: EVAL trying tail ... (cur_eval=100495630)
2117468 <tern > <char s_zon>    |493646|       4162:WHILEM[6/15](0)
                             |493646|       WHILEM: matched 55381 out of 0..65535
2117468 <tern > <char s_zon>    |493647|        4159:GOSUB191[-128:4031] 'identifiercompositefast'(4162)
2117468 <tern > <char s_zon>    |493648|         4031:OPEN191 'identifiercompositefast'(4033)
2117468 <tern > <char s_zon>    |493648|         4033:BRANCH(4080)
2117468 <tern > <char s_zon>    |493649|          4034:GOSUB189[-30:4004] 'identiiferfast'(4037)
2117468 <tern > <char s_zon>    |493650|           4004:OPEN189 'identiiferfast'(4006)
2117468 <tern > <char s_zon>    |493650|           4006:SUSPEND(4012)
2117468 <tern > <char s_zon>    |493651|            4008:STAR(4010)
                             |493651|            POSIXD[\s] can match 0 times out of 2147483647...
2117468 <tern > <char s_zon>    |493652|             4010:SUCCEED(0)
                             |493652|             SUCCEED: subpattern success...
2117468 <tern > <char s_zon>    |493650|           4012:OPEN190 'identf'(4014)
2117468 <tern > <char s_zon>    |493650|           4014:GOSUB61[-2760:1254] 'identifierpure'(4017)
2117468 <tern > <char s_zon>    |493651|            1254:OPEN61 'identifierpure'(1256)
2117468 <tern > <char s_zon>    |493651|            1256:BOUND(1257)
2117468 <tern > <char s_zon>    |493651|            1257:SUSPEND(1293)
2117468 <tern > <char s_zon>    |493652|             1259:ANYOF[A-Z_a-z](1269)
2117469 <ern c> <har s_zone>    |493652|             1269:SUSPEND(1291)
2117469 <ern c> <har s_zone>    |493653|              1271:CURLYN[62]{0,INFTY}(1289)
                             |493653|              ANYOF[0-9@A-Z_a-z] can match 3 times out of 2147483647...
2117472 < char> < s_zone5_m>    |493654|               1289:SUCCEED(0)
                             |493654|               SUCCEED: subpattern success...
2117472 < char> < s_zone5_m>    |493652|             1291:SUCCEED(0)
                             |493652|             SUCCEED: subpattern success...
2117472 < char> < s_zone5_m>    |493651|            1293:BOUND(1294)
2117472 < char> < s_zone5_m>    |493651|            1294:CLOSE61 'identifierpure'(1296)
                             |493651|            END: EVAL trying tail ... (cur_eval=1a8a39510)
2117472 < char> < s_zone5_m>    |493652|             4017:CLOSE190 'identf'(4019)
2117472 < char> < s_zone5_m>    |493652|             4019:LOGICAL[2](4020)
2117472 < char> < s_zone5_m>    |493652|             4020:EVAL(4023)

Program received signal SIGSEGV, Segmentation fault.
0x000000010012e294 in S_save_magic_flags (flags=14680064, sv=0x1a8a38690, mgs_ix=<optimized out>, my_perl=<optimized out>) at mg.c:113
113     mgs->mgs_sv = sv;

The above is on commit:

commit 8290d9dfc2f9409fe4a688c355052cd2bd7ffcdb
Author: Karl Williamson <khw@cpan.org>
Date:   Tue Jul 26 15:48:28 2022 -0600

    APItest:locale.t: Use proper test for LC_ALL presence

    If no LC_ALL, there won't be an LC_ALL() sub.  Instead use the string
    'LC_ALL" and an explicit check to see if it is there.

Plus my optimisation patch (which btw still cuts around half of the execution time - just FYI)

But it was crashing without it as well (and on blead).

The regex is (the executed part at least):

(?<parens>\s*+(?<openparenf>[(\[])(?<inparen>((?&parens)|(?&stringlit)|[^()\[\]"'])*+)(?&inparen)?+(?(?{$+{openparenf} eq '('})\)|\])\s*+)

(?<brackets>\s*+\{(?<inbrackets>((?&brackets)|(?&stringlit)|[^{}"'])*+)(?&inbrackets)?+\}\s*+)

(?<identiiferfast>\s*+(?<identf>(?&identifierpure))(??{checkfastident})\s*+)

(?<identifiercompositefast>(?&identiiferfast)(?(?{$istaggable})\s*+(?&identifierpure)?+\s*+(?&brackets)?+)|[^\w\[\]\(\){=;]|(?&parens)
    |(?=(?<tok>[{=;]))(?{registerfnorobj})(?(?{$+{tok} eq '{'})(?&brackets)
    |(?(?{$+{tok} eq '='})=.*?;|;)))

(?<declsfast>(?&identifiercompositefast)*+(?&declsfast)?+)

perl -V:

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_CTYPE = "UTF-8",
    LANG = "C.UTF8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("C.UTF8").
Summary of my perl5 (revision 5 version 37 subversion 5) configuration:
  Commit id: 86afe6b06de7f1c961f08890ea7dad6c1acbb0d8
  Platform:
    osname=linux
    osvers=6.0.0-g1501278bb7ba
    archname=ppc64-linux-thread-multi
    uname='linux localhost 6.0.0-g1501278bb7ba #9 smp sun oct 16 14:51:46 -00 2022 ppc64 ppc970mp, altivec supported powermac11,2 gnulinux '
    config_args=''
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cc'
    ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    optimize='-g -O2'
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='12.2.0'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=87654321
    doublekind=4
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=6
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib /lib64 /usr/lib64 /lib /usr/local/lib64
    libs=-lpthread -lgdbm -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc
    libc=/lib/../lib64/libc.so.6
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.36'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -g -O2 -L/usr/local/lib -fstack-protector-strong'

Characteristics of this binary (from libperl): 
  Compile-time options:
    HAS_TIMES
    MULTIPLICITY
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_HASH_FUNC_SIPHASH13
    PERL_HASH_USE_SBOX32
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_SAFE_PUTENV
    USE_64_BIT_ALL
    USE_64_BIT_INT
    USE_ITHREADS
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
    USE_REENTRANT_API
    USE_THREAD_SAFE_LOCALE
  Built under linux
  Compiled at Oct 18 2022 23:10:24
  @INC:
    /usr/local/lib/perl5/site_perl/5.37.5/ppc64-linux-thread-multi
    /usr/local/lib/perl5/site_perl/5.37.5
    /usr/local/lib/perl5/5.37.5/ppc64-linux-thread-multi
    /usr/local/lib/perl5/5.37.5
    /usr/local/lib/perl5/site_perl/5.37.3
    /usr/local/lib/perl5/site_perl

It's not my RAM running out because I've 23 GBs

Some more info (with -O0 build):

Program received signal SIGSEGV, Segmentation fault.
0x000000010019c4b8 in S_save_magic_flags (my_perl=0x1005002a0, mgs_ix=-2147452408, sv=0x1071fe810, flags=14680064) at mg.c:113
113     mgs->mgs_sv = sv;
(gdb) p mgs
$1 = (MGS *) 0x7ffe36547a18
(gdb) p *mgs
Cannot access memory at address 0x7ffe36547a18
AnFunctionArray commented 1 year ago

On further inspection this:

mgs_ix=-2147452408

Which is used as pointer (is originally int).

May actually be truncated somewhere I suspect highly.

AnFunctionArray commented 1 year ago

I also made reproducible file (that is not copyrighted) - basically it consists of 123416 lines of extern char dummy[2009];.

Here you can get it (if its easier - but you can also copy paste the above line enough times). noncopyrightsample.pp.zip

You can run my https://github.com/AnFunctionArray/cperllexer (basically perl ./parse.pl noncopyrightsample.pp from project dir).

tonycoz commented 1 year ago

20473 is similar to your #20412 in that it changes the effective type of SSNEW(), but uses a larger type. With U32, on a 64-bit platform this could result in overwriting the start of the save stack.

But I don't think this is the base cause of the problem, the save stack entries in save_magic(), and further entries generated during some of the (?{...}) code isn't being cleaned up, so the save stack gets larger and larger and larger.

Normally this would be handled by wrapping an ENTER/LEAVE pair around the calling code, but I have vague memories of there being a reason this wasn't being done. Do you have any ideas @iabyn ?

demerphq commented 1 year ago

Lets ge the optimization patch merged.

AnFunctionArray commented 1 year ago

Lets ge the optimization patch merged.

Nah it was the fix. I don't know if it was a mistake or not tbh. Because @tonycoz said he was going to investigate it further.

I don't complain - it did fix things.

demerphq commented 1 year ago

@AnFunctionArray i just think we should get the optimization patch your wrote merged, my comment wasnt about this ticket. Just a reminder to @tonycoz and @khwilliamson and me to get your optimization patch reviewed and merged.

AnFunctionArray commented 1 year ago

@AnFunctionArray i just think we should get the optimization patch your wrote merged, my comment wasnt about this ticket. Just a reminder to @tonycoz and @khwilliamson and me to get your optimization patch reviewed and merged.

I'm a little bit tired and read it as an of the sort "Lets go the optimization patch merged." - sorry - yeah I don't mind.

iabyn commented 1 year ago

On Tue, Nov 01, 2022 at 05:05:04PM -0700, Tony Cook wrote:

Normally this would be handled by wrapping an ENTER/LEAVE pair around the calling code, but I have vague memories of there being a reason this wasn't being done. Do you have any ideas @iabyn ?

When code blocks were first added to regexes (before even my time!) it was decided that 'local' should accumulate across iterations (but be undone when backtracking) rather than being undone at the end of each code block.

I've always hated this, as it makes it harder internally (as if code blocks in patterns wasn't already complex enough...)

-- Standards (n). Battle insignia or tribal totems.

demerphq commented 1 year ago

On Mon, 7 Nov 2022, 16:37 iabyn, @.***> wrote:

On Tue, Nov 01, 2022 at 05:05:04PM -0700, Tony Cook wrote:

Normally this would be handled by wrapping an ENTER/LEAVE pair around the calling code, but I have vague memories of there being a reason this wasn't being done. Do you have any ideas @iabyn ?

When code blocks were first added to regexes (before even my time!) it was decided that 'local' should accumulate across iterations (but be undone when backtracking) rather than being undone at the end of each code block.

I've always hated this, as it makes it harder internally (as if code blocks in patterns wasn't already complex

It sounds like you think this should be changed, should we dig into it and see if we can change it?

Yves

iabyn commented 1 year ago

On Tue, Nov 08, 2022 at 01:31:56AM -0800, Yves Orton wrote:

On Mon, 7 Nov 2022, 16:37 iabyn, @.***> wrote:

When code blocks were first added to regexes (before even my time!) it was decided that 'local' should accumulate across iterations (but be undone when backtracking) rather than being undone at the end of each code block.

I've always hated this, as it makes it harder internally (as if code blocks in patterns wasn't already complex

It sounds like you think this should be changed, should we dig into it and see if we can change it?

Well, it's behaviour that (IIRC) is documented in the camel Book - it's certainly a feature not a bug. So although it has made my life hard from time to time when messing in the internals, I've always accepted it and worked around it. I don't think we could change it without breaking stuff.

-- "There's something wrong with our bloody ships today, Chatfield." -- Admiral Beatty at the Battle of Jutland, 31st May 1916.

hvds commented 1 year ago

I don't think we could change it without breaking stuff.

Changing it would certainly break a lot of my stuff, some of which is still in production. I consider it an essential feature for non-trivial recursive regexps such as grammars.

demerphq commented 1 year ago

@hvds can you work out a simple example script to demonstrate what this provides? I don't want or intend to break anything, but I would like to understand the intent and background here (and maybe take the time to document it somewhere).

Maybe i misunderstand. My understanding is that in code like this:

/(?{ ... })/

we do not collect locals when the block ends, but we do on backtracking. But i can't quite picture in my mind what this enables exactly. @iabyn described what is supposed to happen and said this is demonstrated in the camel book, but didn't mention where. If you can come up with a simple demo it would be helpful. I will review the camel book, but i suspect since you care about this you can come up with an example fairly directly.

AnFunctionArray commented 1 year ago

@demerphq That's interesting - I personally don't think I use that. But if it's like this - do perl have destructors - because I could use this possibly as a way to catch backtracking - currently I have this:

(?(?=(pattern))\g{-1}(?{code tru})|(?{code fals})(*F))

AnFunctionArray commented 1 year ago

But maybe it could be more elegantly written - with this feature.

demerphq commented 1 year ago

@AnFunctionArray What do you mean "catch backtracking"? In theory we could have a code block that executes only when traversed into via backtracking. Eg, something like this:

/PAT_A(?-{ print "this only prints if the thing following it fails to match"})PAT_B/

So the (?-{}) would be treated as a zero width always accepting assertion, but when backtracked into would execute its contents like a (?{ }) would. Put another way it would execute when entered "from the right", the opposite of (?{...}) which is a zero width always accepting assertion which executes when entered "from the left". FWIW, it can simplify things to think of what regops do when entered from the left or right, that is what they do when they are supposed to match (from the left), and what they do when they are backtracked into (from the right).

demerphq commented 1 year ago

@AnFunctionArray if you are interested in this stuff maybe try reaching out to me on the #p5p irc channel.

AnFunctionArray commented 1 year ago

@demerphq I'm definitely interested in this stuff maybe I'll join but I've issue with the fact that you must be constantly online to keep with news there.

AnFunctionArray commented 1 year ago

@demerphq - maybe if you are interested in joining our discord server @tonycoz ?

If you like use discord.

hvds commented 1 year ago

@hvds can you work out a simple example script to demonstrate what this provides?

I don't have access to the serious examples, those were all at work.

My crossword-helper program provides some less serious examples. Throughout, we may use $succeed = qr{(?=)}, $fail = qr{(?!)}.

Here's a pattern that matches words that are an anagram of ab...:

/^(?:(?=.*a)(?=.*b).{5,5})\z/oi

While this matches words that are an anagram of a subset of ab...:

/^(?{ @d = (3) })(?:(?:a(?!.*a)|b(?!.*b)|.(??{
  local $d[0] = $d[0] - 1; $d[0] >= 0 ? $succeed : $fail
}))+)\z/oi

This matches an anagram of a[ab]...:

/^(?{
  $d[0] = [ [ 1, 1 ], [ 3, 3 ], [ 1, 1 ] ];
  $e[0] = [ (0) x 3 ];
})(?:(?:a(??{ 
  local $e[0][0] = $e[0][0] + 1;
  $e[0][0] > $d[0][0][1] ? $fail : $succeed
})|.(??{ 
  local $e[0][1] = $e[0][1] + 1;
  $e[0][1] > $d[0][1][1] ? $fail : $succeed
})|[ab](??{ 
  local $e[0][2] = $e[0][2] + 1;
  $e[0][2] > $d[0][2][1] ? $fail : $succeed
}))*(??{
  (grep $e[0][$_] < $d[0][$_][0], 0 .. 2) ? $fail : $succeed
}))\z/oi

And this matches an anagram of a subset of a[ab]...:

/^(?{ @d = (3, 3) })(?:(?:b(?!.*b)|a(?!.*a)|.(??{
  local $d[0] = $d[0] - 1; $d[0] >= 0 ? $succeed : $fail
}))+|(?:a(?!.*a.*a)|.(??{
  local $d[1] = $d[1] - 1; $d[1] >= 0 ? $succeed : $fail
}))+)\z/oi
AnFunctionArray commented 1 year ago

@hvds I used to do this but I reckon it was slow so I switched to:

(?(?{checkident()})|(*F))

AnFunctionArray commented 1 year ago

But I'm not sure how this relates to locals being kept until backtrack (if I understand the feature in question).

demerphq commented 1 year ago

@AnFunctionArray we are having a near synchonous conversation in github ticket comments, IMO p5p would make that process quite a bit more efficient.

AnFunctionArray commented 1 year ago

@demerphq I've written there - It's MAGnet right?