chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.8k stars 421 forks source link

System dependent segmentation fault during compilation of program #9047

Closed ian-bertolacci closed 6 years ago

ian-bertolacci commented 6 years ago

Summary of Problem

The below program does not compile on some systems (my laptop) but does on others (our high performance gaming pc research test server).

Attached is the information about both systems. They have identical chapel versions, linux kernels, gcc versions, so cpu, current memory, and PCI information is also included, since my hunch is that this is a really weird hardware issue

The code is automatically generated for the purpose of timing how long accesses of an array take. The code in question was generated to time 93 total accesses into an array. Compiling this code (or outputting C code) produces a segmentation fault on the laptop, but successful builds on the research server. However, code generated to time 92 total accesses successfully builds on both the laptop and research machine.

Steps to Reproduce

Source Code:

use Time;
// writeln substitute
proc w( v ){
    writeln( v );
}
proc main(){
    const N : int = 1000;
    var a : [0..#N] int;w(t0(a)[1]);}// Close main
proc t0(ref a){
    var t=new Timer();
    t.start();
    ref r0=a[0];
    ref r1=a[1];
    ref r2=a[2];
    ref r3=a[3];
    ref r4=a[4];
    ref r5=a[5];
    ref r6=a[6];
    ref r7=a[7];
    ref r8=a[8];
    ref r9=a[9];
    ref r10=a[10];
    ref r11=a[11];
    ref r12=a[12];
    ref r13=a[13];
    ref r14=a[14];
    ref r15=a[15];
    ref r16=a[16];
    ref r17=a[17];
    ref r18=a[18];
    ref r19=a[19];
    ref r20=a[20];
    ref r21=a[21];
    ref r22=a[22];
    ref r23=a[23];
    ref r24=a[24];
    ref r25=a[25];
    ref r26=a[26];
    ref r27=a[27];
    ref r28=a[28];
    ref r29=a[29];
    ref r30=a[30];
    ref r31=a[31];
    ref r32=a[32];
    ref r33=a[33];
    ref r34=a[34];
    ref r35=a[35];
    ref r36=a[36];
    ref r37=a[37];
    ref r38=a[38];
    ref r39=a[39];
    ref r40=a[40];
    ref r41=a[41];
    ref r42=a[42];
    ref r43=a[43];
    ref r44=a[44];
    ref r45=a[45];
    ref r46=a[46];
    ref r47=a[47];
    ref r48=a[48];
    ref r49=a[49];
    ref r50=a[50];
    ref r51=a[51];
    ref r52=a[52];
    ref r53=a[53];
    ref r54=a[54];
    ref r55=a[55];
    ref r56=a[56];
    ref r57=a[57];
    ref r58=a[58];
    ref r59=a[59];
    ref r60=a[60];
    ref r61=a[61];
    ref r62=a[62];
    ref r63=a[63];
    ref r64=a[64];
    ref r65=a[65];
    ref r66=a[66];
    ref r67=a[67];
    ref r68=a[68];
    ref r69=a[69];
    ref r70=a[70];
    ref r71=a[71];
    ref r72=a[72];
    ref r73=a[73];
    ref r74=a[74];
    ref r75=a[75];
    ref r76=a[76];
    ref r77=a[77];
    ref r78=a[78];
    ref r79=a[79];
    ref r80=a[80];
    ref r81=a[81];
    ref r82=a[82];
    ref r83=a[83];
    ref r84=a[84];
    ref r85=a[85];
    ref r86=a[86];
    ref r87=a[87];
    ref r88=a[88];
    ref r89=a[89];
    ref r90=a[90];
    ref r91=a[91];
    ref r92=a[92];
    t.stop();
    var v=r0+r1+r2+r3+r4+r5+r6+r7+r8+r9+r10+r11+r12+r13+r14+r15+r16+r17+r18+r19+r20+r21+r22+r23+r24+r25+r26+r27+r28+r29+r30+r31+r32+r33+r34+r35+r36+r37+r38+r39+r40+r41+r42+r43+r44+r45+r46+r47+r48+r49+r50+r51+r52+r53+r54+r55+r56+r57+r58+r59+r60+r61+r62+r63+r64+r65+r66+r67+r68+r69+r70+r71+r72+r73+r74+r75+r76+r77+r78+r79+r80+r81+r82+r83+r84+r85+r86+r87+r88+r89+r90+r91+r92;
    return (t.elapsed()/93,v);
}

This code was generate with the following python script:

#!/usr/bin/env python3

accesses = 93
N = 10
modifier = "ref"

with open("output.chpl", "w") as output:

  output.write(
    "\n".join(
      [
        "use Time;",
        "// writeln substitute",
        "proc w( v ){",
        "\twriteln( v );",
        "}",
        "proc main(){",
        "\tconst N : int = {};".format(N),
        "\tvar a : [0..#N] int;",
      ]
    )
  )
  for i in range(1):
    output.write( "w(t{}(a)[1]);".format( i ) )

  output.write("}// Close main\n")

  for proc_i in range(1):
    output.write("proc t{}(ref a){{\n\tvar t=new Timer();\n\tt.start();\n".format(proc_i))
    for access_i in range(accesses):
      index = access_i % N
      output.write("\t{} r{}=a[{}];\n".format(modifier, access_i, index) )
    output.write("\tt.stop();\n\tvar v=r0")
    for access_i in range(1,accesses):
      output.write("+r{}".format(access_i))
    output.write(";\n\treturn (t.elapsed()/{},v);\n}}".format(accesses));

Compile command: Either

The python script can be run as a script. If you want to change the number of accesses generated, set the variable accesses to the desired number. The parameter N is the size of the array, but changing it does not seem to impact whether or not it compiles.

Configuration Information

Failing System (Laptop):

CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none *
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_MAKE: make
CHPL_ATOMICS: intrinsics
CHPL_GMP: none
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_WIDE_POINTERS: struct
CHPL_AUX_FILESYS: none
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 78
Model name:            Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
Stepping:              3
CPU MHz:               499.968
CPU max MHz:           2800.0000
CPU min MHz:           400.0000
BogoMIPS:              4800.44
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt spec_ctrl retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
MemTotal:       15835684 kB
MemFree:          164472 kB
MemAvailable:   10151856 kB
Buffers:          645852 kB
Cached:          8776892 kB
SwapCached:            0 kB
Active:          6475876 kB
Inactive:        6767660 kB
Active(anon):    3565060 kB
Inactive(anon):   793448 kB
Active(file):    2910816 kB
Inactive(file):  5974212 kB
Unevictable:        1728 kB
Mlocked:            1728 kB
SwapTotal:      16174588 kB
SwapFree:       16174588 kB
Dirty:             41912 kB
Writeback:             0 kB
AnonPages:       3823396 kB
Mapped:           798716 kB
Shmem:            536852 kB
Slab:            2217096 kB
SReclaimable:    1440804 kB
SUnreclaim:       776292 kB
KernelStack:       16304 kB
PageTables:        80456 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    24092428 kB
Committed_AS:   23120736 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1366016 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      442048 kB
DirectMap2M:    14684160 kB
DirectMap1G:     2097152 kB

Succeeding System (Research Server)

CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none *
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_MAKE: make
CHPL_ATOMICS: intrinsics
CHPL_GMP: gmp
CHPL_HWLOC: none *
CHPL_REGEXP: re2
CHPL_WIDE_POINTERS: struct
CHPL_AUX_FILESYS: none
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz
Stepping:              1
CPU MHz:               1212.125
CPU max MHz:           4000.0000
CPU min MHz:           1200.0000
BogoMIPS:              6399.98
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch invpcid_single intel_pt retpoline kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
MemTotal:       32848556 kB
MemFree:        29097708 kB
MemAvailable:   32005980 kB
Buffers:          891900 kB
Cached:          1922940 kB
SwapCached:            0 kB
Active:          2161720 kB
Inactive:         854420 kB
Active(anon):     207036 kB
Inactive(anon):    27964 kB
Active(file):    1954684 kB
Inactive(file):   826456 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      63476732 kB
SwapFree:       63476732 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        201376 kB
Mapped:           169192 kB
Shmem:             33660 kB
Slab:             594068 kB
SReclaimable:     538212 kB
SUnreclaim:        55856 kB
KernelStack:        7088 kB
PageTables:        14008 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    79901008 kB
Committed_AS:    2392316 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:     81920 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      121744 kB
DirectMap2M:     5023744 kB
DirectMap1G:    30408704 kB
ian-bertolacci commented 6 years ago

I've broken it on our server as well, when generating with access > 154.

cassella commented 6 years ago

It fails (with 93 accesses) on my laptop as well, 6 hundred-odd stack frames in.

fortytwo@magrathea:~/src/chpl/ogg$ chpl --gdb output.chpl
Reading symbols from chpl...done.
Breakpoint 1 at 0x6fcd39: file misc.cpp, line 61.
(gdb) r
Starting program: /home/fortytwo/src/chapel/bin/linux64/chpl output.chpl

Program received signal SIGSEGV, Segmentation fault.
0x000000000052138c in CallExpr::codegenPrimitive (
    this=<error reading variable: Cannot access memory at address 0x7fffff7f6330>) at expr.cpp:3490
3490    GenRet CallExpr::codegenPrimitive() {
(gdb) i s
#0  0x000000000052138c in CallExpr::codegenPrimitive (
    this=<error reading variable: Cannot access memory at address 0x7fffff7f6330>) at expr.cpp:3490
#1  0x000000000052077e in CallExpr::codegen (this=0xb10e9a0) at expr.cpp:3376
#2  0x000000000045c9d2 in baseASTCodegen (ast=0xb10e9a0) at baseAST.cpp:677
#3  0x0000000000535782 in GenRet::GenRet (this=0x7fffff805ed0, ast=0xb10e9a0) at ../include/genret.h:145
#4  0x00000000005233f2 in CallExpr::codegenPrimitive (this=0xb02cc0) at expr.cpp:3792
#5  0x000000000052077e in CallExpr::codegen (this=0xb02cc0) at expr.cpp:3376
#6  0x000000000045c9d2 in baseASTCodegen (ast=0xb02cc0) at baseAST.cpp:677
#7  0x0000000000535782 in GenRet::GenRet (this=0x7fffff81c670, ast=0xb02cc0) at ../include/genret.h:145
#8  0x000000000052a8b7 in CallExpr::codegenPrimitive (this=0x2bace80) at expr.cpp:4736
#9  0x000000000052077e in CallExpr::codegen (this=0x2bace80) at expr.cpp:3376
#10 0x000000000045c9d2 in baseASTCodegen (ast=0x2bace80) at baseAST.cpp:677
#11 0x0000000000535782 in GenRet::GenRet (this=0x7fffff81fbb0, ast=0x2bace80) at ../include/genret.h:145
...
#626 0x000000000045c9d2 in baseASTCodegen (ast=0x1889e90) at baseAST.cpp:677
#627 0x0000000000535782 in GenRet::GenRet (this=0x7ffffffe47f0, ast=0x1889e90) at ../include/genret.h:145
#628 0x000000000052341b in CallExpr::codegenPrimitive (this=0xcd89eb0) at expr.cpp:3792
#629 0x000000000052077e in CallExpr::codegen (this=0xcd89eb0) at expr.cpp:3376
#630 0x000000000045c9d2 in baseASTCodegen (ast=0xcd89eb0) at baseAST.cpp:677
#631 0x0000000000535782 in GenRet::GenRet (this=0x7ffffffef7f0, ast=0xcd89eb0) at ../include/genret.h:145
#632 0x0000000000530dd7 in CallExpr::codegenPrimMove (this=0xcd89db0) at expr.cpp:5185
#633 0x0000000000522465 in CallExpr::codegenPrimitive (this=0xcd89db0) at expr.cpp:3618
#634 0x000000000052077e in CallExpr::codegen (this=0xcd89db0) at expr.cpp:3376
#635 0x000000000053975f in AList::codegen (this=0xafc6b30, separator=0x7183c7 "") at alist.cpp:35
#636 0x0000000000539daa in BlockStmt::codegen (this=0xafc6ae0) at stmt.cpp:115
#637 0x000000000053d750 in FnSymbol::codegenDef (this=0xafc63b0) at symbol.cpp:1536
#638 0x000000000053db45 in ModuleSymbol::codegenDef (this=0x253b190) at symbol.cpp:1654
#639 0x0000000000504b2f in codegen () at codegen.cpp:2261
#640 0x00000000005661f8 in runPass (tracker=..., passIndex=40, isChpldoc=false) at runpasses.cpp:202
#641 0x000000000056610c in runPasses (tracker=..., isChpldoc=false) at runpasses.cpp:169
#642 0x000000000056131d in main (argc=2, argv=0x7fffffffde78) at driver.cpp:1350
ian-bertolacci commented 6 years ago

I've changed the code generator, to generate compound assignment statements (s += r0; s+= r1;...) instead of a huge addition expression (s=r0+r1...;). This seems to improve things, and I can now compile with a very large number of accesses. Im not sure what the upper bound is (for reasons that will become obvious) but I was able to compile with 30,000 accesses on the research server.

It did take ~11 hours to find that out. Im not sure how much time is spent in the Chapel compiler vs the C compiler, but they both take a significant portion of time.

Wonder what happens if I add --fast to the mix.....

#!/usr/bin/env python3

import argparse

parser = argparse.ArgumentParser()
parser.add_argument( "--accesses", type=int, default=100 )
parser.add_argument( "--N", type=int, default=1000 )
args = parser.parse_args()

accesses = args.accesses
N = args.N
modifier = "ref"

with open("gen_{}_{}.chpl".format(N,accesses), "w") as output:

  output.write(
    "\n".join(
      [
        "use Time;",
        "// writeln substitute",
        "proc w( v ){",
        "\twriteln( v );",
        "}",
        "proc main(){",
        "\tconst N : int = {};".format(N),
        "\tvar a : [0..#N] int;",
      ]
    )
  )
  for i in range(1):
    output.write( "w(t{}(a));".format( i ) )

  output.write("}// Close main\n")

  for proc_i in range(1):
    output.write("proc t{}(ref a){{var t=new Timer();t.start();".format(proc_i))
    for access_i in range(accesses):
      index = access_i % N
      output.write("{} r{}=a[{}];".format(modifier, access_i, index) )
    output.write("t.stop();var s=r0;")
    for access_i in range(1,accesses):
      output.write("s+=r{};".format(access_i))
    output.write(";return (t.elapsed()/{},s);}}".format(accesses));

(I've also added some cli args, and it generates to a file with a meaningful name)

daviditen commented 6 years ago

This looks like a stack overflow during codegen. I can reproduce it on my MBP laptop with 56 accesses, but not 55.

ian-bertolacci commented 6 years ago

Interesting. So the "machine dependent"-ness of this issue boils down to how much stack space the OS allocates to an application? I wonder if I upgraded my Laptop to 32GB if I would could compile with 154 accesses under the original code scheme.

bradcray commented 6 years ago

@daviditen : Out of curiosity do you know:

ian-bertolacci commented 6 years ago

@bradcray I just tried to compile my original code with --print-commands --print-passes and got the following:

> chpl --print-commands --print-passes ./test.chpl 
                            init :   0.067 seconds
                           parse :   0.158 seconds
                     checkParsed :   0.020 seconds
                            docs :   0.003 seconds
                     readExternC :   0.003 seconds
          expandExternArrayCalls :   0.004 seconds
                         cleanup :   0.032 seconds
                    scopeResolve :   0.264 seconds
                  flattenClasses :   0.003 seconds
                       normalize :   0.370 seconds
                 checkNormalized :   0.011 seconds
           buildDefaultFunctions :   0.200 seconds
             createTaskFunctions :   0.021 seconds
                         resolve :   2.135 seconds
                  resolveIntents :   0.005 seconds
                   checkResolved :   0.012 seconds
replaceArrayAccessesWithRefTemps :   0.007 seconds
           processIteratorYields :   0.013 seconds
                flattenFunctions :   0.016 seconds
              cullOverReferences :   0.012 seconds
              lowerErrorHandling :   0.019 seconds
                 callDestructors :   0.039 seconds
                  lowerIterators :   0.138 seconds
                        parallel :   0.021 seconds
                           prune :   0.055 seconds
                 bulkCopyRecords :   0.005 seconds
  removeUnnecessaryAutoCopyCalls :   0.022 seconds
                 inlineFunctions :   0.119 seconds
                   scalarReplace :   0.025 seconds
                  refPropagation :   0.072 seconds
                 copyPropagation :   0.125 seconds
             deadCodeElimination :   0.051 seconds
               removeWrapRecords :   0.006 seconds
              removeEmptyRecords :   0.005 seconds
                 localizeGlobals :   0.013 seconds
         loopInvariantCodeMotion :   0.048 seconds
                          prune2 :   0.012 seconds
       returnStarTuplesByRefArgs :   0.006 seconds
            insertWideReferences :   0.003 seconds
               optimizeOnClauses :   0.017 seconds
                    addInitCalls :   0.006 seconds
               insertLineNumbers :   0.028 seconds
                     denormalize :   0.052 seconds
Segmentation fault (core dumped)
ian-bertolacci commented 6 years ago

Also, I'm beginning to doubt my ~11 hour build time. I just tried the same compile with --fast and it took 7 minutes, which was very surprising, so I'm trying it again. Maybe my parallel build idea was not so good...

Also, the binary is much smaller, 3.6MB with --fast vs 25MB without.

Also, the original 93 accesses code also doesn't compile under --fast.

ian-bertolacci commented 6 years ago

Now this is really interesting, and could probably be its own issue. As I said in my third comment, I updated my code generator, and can now compile benchmarks with tens of thousands of accesses. In that example, I generated and compiled a program with 30K accesses of a 100K element array.

I also mentioned that it took ~11 hours to compile without --fast. Yesterday I found that with --fast, compilation only takes 7 minutes That's one hell of a difference, and the total opposite of what I expected. The --print-passes-file (below) show that the makeBinary 'pass' contributes the most to the compile time (404 seconds with --fast and 37144 seconds without --fast).

Here is the --print-passes-file output from the compilation with --fast:

                            init :   0.060 seconds
                           parse :   0.347 seconds
                     checkParsed :   0.037 seconds
                            docs :   0.005 seconds
                     readExternC :   0.005 seconds
          expandExternArrayCalls :   0.006 seconds
                         cleanup :   0.038 seconds
                    scopeResolve :   0.380 seconds
                  flattenClasses :   0.006 seconds
                       normalize :   0.606 seconds
                 checkNormalized :   0.014 seconds
           buildDefaultFunctions :   0.185 seconds
             createTaskFunctions :   0.024 seconds
                         resolve :   3.211 seconds
                  resolveIntents :   0.017 seconds
                   checkResolved :   0.034 seconds
replaceArrayAccessesWithRefTemps :   0.012 seconds
           processIteratorYields :   0.026 seconds
                flattenFunctions :   0.029 seconds
              cullOverReferences :   0.167 seconds
              lowerErrorHandling :   0.047 seconds
                 callDestructors :   0.165 seconds
                  lowerIterators :   0.281 seconds
                        parallel :   0.049 seconds
                           prune :   0.108 seconds
                 bulkCopyRecords :   0.022 seconds
  removeUnnecessaryAutoCopyCalls :   0.030 seconds
                 inlineFunctions :   0.594 seconds
                   scalarReplace :   0.329 seconds
                  refPropagation :   0.916 seconds
                 copyPropagation :   1.415 seconds
             deadCodeElimination :   0.455 seconds
               removeWrapRecords :   0.030 seconds
              removeEmptyRecords :   0.030 seconds
                 localizeGlobals :   0.099 seconds
         loopInvariantCodeMotion :   0.129 seconds
                          prune2 :   0.138 seconds
       returnStarTuplesByRefArgs :   0.055 seconds
            insertWideReferences :   0.033 seconds
               optimizeOnClauses :   0.071 seconds
                    addInitCalls :   0.030 seconds
               insertLineNumbers :   0.193 seconds
                     denormalize :   0.720 seconds
                         codegen :   0.859 seconds
                      makeBinary : 404.932 seconds
                   driverCleanup :   0.000 seconds
                      total time : 417.207 seconds

Pass               Name                   Main    Check    Clean     Time    %     Accum    %  
---- ---------------------------------  -------  -------  -------  ------- -----  ------- -----
     startup                              0.000    0.000    0.000    0.000   0.0    0.000   0.0
     init                                 0.060    0.000    0.000    0.060   0.0    0.060   0.0
   1 parse                                0.332    0.000    0.015    0.347   0.1    0.407   0.1
   2 checkParsed                          0.030    0.001    0.006    0.037   0.0    0.444   0.1
   3 docs                                 0.000    0.000    0.005    0.005   0.0    0.450   0.1
   4 readExternC                          0.000    0.000    0.005    0.005   0.0    0.455   0.1
   5 expandExternArrayCalls               0.000    0.000    0.005    0.006   0.0    0.461   0.1
   6 cleanup                              0.032    0.000    0.006    0.038   0.0    0.499   0.1
   7 scopeResolve                         0.368    0.000    0.012    0.380   0.1    0.879   0.2
   8 flattenClasses                       0.000    0.000    0.006    0.006   0.0    0.885   0.2
   9 normalize                            0.591    0.000    0.015    0.606   0.1    1.491   0.4
  10 checkNormalized                      0.003    0.000    0.011    0.014   0.0    1.505   0.4
  11 buildDefaultFunctions                0.174    0.000    0.012    0.185   0.0    1.690   0.4
  12 createTaskFunctions                  0.002    0.011    0.011    0.024   0.0    1.714   0.4
  13 resolve                              2.988    0.017    0.206    3.211   0.8    4.925   1.2
  14 resolveIntents                       0.005    0.000    0.011    0.017   0.0    4.942   1.2
  15 checkResolved                        0.022    0.000    0.011    0.034   0.0    4.976   1.2
  16 replaceArrayAccessesWithRefTemps     0.000    0.000    0.011    0.012   0.0    4.988   1.2
  17 processIteratorYields                0.014    0.000    0.012    0.026   0.0    5.014   1.2
  18 flattenFunctions                     0.017    0.000    0.011    0.029   0.0    5.043   1.2
  19 cullOverReferences                   0.143    0.001    0.024    0.167   0.0    5.210   1.2
  20 lowerErrorHandling                   0.038    0.000    0.009    0.047   0.0    5.257   1.3
  21 callDestructors                      0.152    0.000    0.012    0.165   0.0    5.421   1.3
  22 lowerIterators                       0.254    0.013    0.015    0.281   0.1    5.703   1.4
  23 parallel                             0.022    0.015    0.012    0.049   0.0    5.752   1.4
  24 prune                                0.072    0.012    0.024    0.108   0.0    5.860   1.4
  25 bulkCopyRecords                      0.000    0.010    0.011    0.022   0.0    5.882   1.4
  26 removeUnnecessaryAutoCopyCalls       0.009    0.010    0.011    0.030   0.0    5.912   1.4
  27 inlineFunctions                      0.480    0.026    0.088    0.594   0.1    6.506   1.6
  28 scalarReplace                        0.240    0.028    0.061    0.329   0.1    6.835   1.6
  29 refPropagation                       0.847    0.021    0.048    0.916   0.2    7.751   1.9
  30 copyPropagation                      1.280    0.017    0.118    1.415   0.3    9.167   2.2
  31 deadCodeElimination                  0.422    0.015    0.018    0.455   0.1    9.621   2.3
  32 removeWrapRecords                    0.000    0.015    0.015    0.030   0.0    9.651   2.3
  33 removeEmptyRecords                   0.000    0.015    0.015    0.030   0.0    9.681   2.3
  34 localizeGlobals                      0.070    0.015    0.015    0.099   0.0    9.780   2.3
  35 loopInvariantCodeMotion              0.099    0.015    0.015    0.129   0.0    9.909   2.4
  36 prune2                               0.108    0.015    0.015    0.138   0.0   10.047   2.4
  37 returnStarTuplesByRefArgs            0.025    0.015    0.015    0.055   0.0   10.103   2.4
  38 insertWideReferences                 0.003    0.015    0.015    0.033   0.0   10.136   2.4
  39 optimizeOnClauses                    0.041    0.015    0.015    0.071   0.0   10.207   2.4
  40 addInitCalls                         0.001    0.015    0.015    0.030   0.0   10.237   2.5
  41 insertLineNumbers                    0.156    0.015    0.021    0.193   0.0   10.430   2.5
  42 denormalize                          0.667    0.000    0.053    0.720   0.2   11.150   2.7
  43 codegen                              0.841    0.000    0.018    0.859   0.2   12.009   2.9
  44 makeBinary                         404.912    0.000    0.285  405.198  97.1  417.207 100.0
     driverCleanup                        0.000    0.000    0.000    0.000   0.0  417.207 100.0

     total time                         415.524    0.345    1.337  417.207

Pass               Name                   Main    Check    Clean     Time    %     Accum    %  
---- ---------------------------------  -------  -------  -------  ------- -----  ------- -----
  44 makeBinary                         404.912    0.000    0.285  405.198  97.1  405.198  97.1
  13 resolve                              2.988    0.017    0.206    3.211   0.8  408.409  97.9
  30 copyPropagation                      1.280    0.017    0.118    1.415   0.3  409.824  98.2
  29 refPropagation                       0.847    0.021    0.048    0.916   0.2  410.740  98.5
  43 codegen                              0.841    0.000    0.018    0.859   0.2  411.599  98.7
  42 denormalize                          0.667    0.000    0.053    0.720   0.2  412.320  98.8
   9 normalize                            0.591    0.000    0.015    0.606   0.1  412.926  99.0
  27 inlineFunctions                      0.480    0.026    0.088    0.594   0.1  413.520  99.1
  31 deadCodeElimination                  0.422    0.015    0.018    0.455   0.1  413.975  99.2
   7 scopeResolve                         0.368    0.000    0.012    0.380   0.1  414.354  99.3
   1 parse                                0.332    0.000    0.015    0.347   0.1  414.701  99.4
  28 scalarReplace                        0.240    0.028    0.061    0.329   0.1  415.031  99.5
  22 lowerIterators                       0.254    0.013    0.015    0.281   0.1  415.312  99.5
  41 insertLineNumbers                    0.156    0.015    0.021    0.193   0.0  415.505  99.6
  11 buildDefaultFunctions                0.174    0.000    0.012    0.185   0.0  415.690  99.6
  19 cullOverReferences                   0.143    0.001    0.024    0.167   0.0  415.857  99.7
  21 callDestructors                      0.152    0.000    0.012    0.165   0.0  416.022  99.7
  36 prune2                               0.108    0.015    0.015    0.138   0.0  416.160  99.7
  35 loopInvariantCodeMotion              0.099    0.015    0.015    0.129   0.0  416.289  99.8
  24 prune                                0.072    0.012    0.024    0.108   0.0  416.397  99.8
  34 localizeGlobals                      0.070    0.015    0.015    0.099   0.0  416.496  99.8
  39 optimizeOnClauses                    0.041    0.015    0.015    0.071   0.0  416.567  99.8
     init                                 0.060    0.000    0.000    0.060   0.0  416.628  99.9
  37 returnStarTuplesByRefArgs            0.025    0.015    0.015    0.055   0.0  416.683  99.9
  23 parallel                             0.022    0.015    0.012    0.049   0.0  416.732  99.9
  20 lowerErrorHandling                   0.038    0.000    0.009    0.047   0.0  416.779  99.9
   6 cleanup                              0.032    0.000    0.006    0.038   0.0  416.818  99.9
   2 checkParsed                          0.030    0.001    0.006    0.037   0.0  416.854  99.9
  15 checkResolved                        0.022    0.000    0.011    0.034   0.0  416.888  99.9
  38 insertWideReferences                 0.003    0.015    0.015    0.033   0.0  416.921  99.9
  26 removeUnnecessaryAutoCopyCalls       0.009    0.010    0.011    0.030   0.0  416.952  99.9
  40 addInitCalls                         0.001    0.015    0.015    0.030   0.0  416.982  99.9
  33 removeEmptyRecords                   0.000    0.015    0.015    0.030   0.0  417.012 100.0
  32 removeWrapRecords                    0.000    0.015    0.015    0.030   0.0  417.042 100.0
  18 flattenFunctions                     0.017    0.000    0.011    0.029   0.0  417.070 100.0
  17 processIteratorYields                0.014    0.000    0.012    0.026   0.0  417.096 100.0
  12 createTaskFunctions                  0.002    0.011    0.011    0.024   0.0  417.120 100.0
  25 bulkCopyRecords                      0.000    0.010    0.011    0.022   0.0  417.142 100.0
  14 resolveIntents                       0.005    0.000    0.011    0.017   0.0  417.159 100.0
  10 checkNormalized                      0.003    0.000    0.011    0.014   0.0  417.173 100.0
  16 replaceArrayAccessesWithRefTemps     0.000    0.000    0.011    0.012   0.0  417.184 100.0
   8 flattenClasses                       0.000    0.000    0.006    0.006   0.0  417.190 100.0
   5 expandExternArrayCalls               0.000    0.000    0.005    0.006   0.0  417.196 100.0
   3 docs                                 0.000    0.000    0.005    0.005   0.0  417.202 100.0
   4 readExternC                          0.000    0.000    0.005    0.005   0.0  417.207 100.0
     startup                              0.000    0.000    0.000    0.000   0.0  417.207 100.0
     driverCleanup                        0.000    0.000    0.000    0.000   0.0  417.207 100.0

     total time                         415.524    0.345    1.337  417.207

And from the time command of that compilation:

real    6m57.386s
user    6m55.520s
sys     0m1.752s

Here is the --print-passes-file output from the compilation without --fast:

                            init :   0.059 seconds
                           parse :   0.395 seconds
                     checkParsed :   0.041 seconds
                            docs :   0.006 seconds
                     readExternC :   0.006 seconds
          expandExternArrayCalls :   0.006 seconds
                         cleanup :   0.042 seconds
                    scopeResolve :   0.384 seconds
                  flattenClasses :   0.006 seconds
                       normalize :   0.622 seconds
                 checkNormalized :   0.014 seconds
           buildDefaultFunctions :   0.198 seconds
             createTaskFunctions :   0.024 seconds
                         resolve :   3.349 seconds
                  resolveIntents :   0.017 seconds
                   checkResolved :   0.035 seconds
replaceArrayAccessesWithRefTemps :   0.012 seconds
           processIteratorYields :   0.024 seconds
                flattenFunctions :   0.032 seconds
              cullOverReferences :   0.172 seconds
              lowerErrorHandling :   0.064 seconds
                 callDestructors :   0.153 seconds
                  lowerIterators :   0.290 seconds
                        parallel :   0.047 seconds
                           prune :   0.114 seconds
                 bulkCopyRecords :   0.022 seconds
  removeUnnecessaryAutoCopyCalls :   0.031 seconds
                 inlineFunctions :   1.128 seconds
                   scalarReplace :   0.392 seconds
                  refPropagation :   1.554 seconds
                 copyPropagation : 422.381 seconds
             deadCodeElimination :   1.386 seconds
               removeWrapRecords :   0.085 seconds
              removeEmptyRecords :   0.084 seconds
                 localizeGlobals :   0.248 seconds
         loopInvariantCodeMotion :   0.915 seconds
                          prune2 :   0.358 seconds
       returnStarTuplesByRefArgs :   0.181 seconds
            insertWideReferences :   0.094 seconds
               optimizeOnClauses :   0.223 seconds
                    addInitCalls :   0.085 seconds
               insertLineNumbers :   0.714 seconds
                     denormalize :   1.438 seconds
                         codegen :   3.674 seconds
                      makeBinary :37144.535 seconds
                   driverCleanup :   0.000 seconds
                      total time :37586.354 seconds

Pass               Name                   Main    Check    Clean     Time    %     Accum    %  
---- ---------------------------------  -------  -------  -------  ------- -----  ------- -----
     startup                              0.000    0.000    0.000    0.000   0.0    0.000   0.0
     init                                 0.059    0.000    0.000    0.059   0.0    0.059   0.0
   1 parse                                0.377    0.000    0.018    0.395   0.0    0.454   0.0
   2 checkParsed                          0.034    0.001    0.006    0.041   0.0    0.495   0.0
   3 docs                                 0.000    0.000    0.006    0.006   0.0    0.500   0.0
   4 readExternC                          0.000    0.000    0.006    0.006   0.0    0.506   0.0
   5 expandExternArrayCalls               0.000    0.000    0.006    0.006   0.0    0.512   0.0
   6 cleanup                              0.035    0.000    0.006    0.042   0.0    0.554   0.0
   7 scopeResolve                         0.372    0.000    0.012    0.384   0.0    0.938   0.0
   8 flattenClasses                       0.000    0.000    0.006    0.006   0.0    0.944   0.0
   9 normalize                            0.607    0.000    0.016    0.622   0.0    1.566   0.0
  10 checkNormalized                      0.003    0.000    0.011    0.014   0.0    1.580   0.0
  11 buildDefaultFunctions                0.185    0.000    0.013    0.198   0.0    1.778   0.0
  12 createTaskFunctions                  0.002    0.011    0.011    0.024   0.0    1.802   0.0
  13 resolve                              3.119    0.017    0.213    3.349   0.0    5.151   0.0
  14 resolveIntents                       0.005    0.000    0.011    0.017   0.0    5.168   0.0
  15 checkResolved                        0.023    0.000    0.012    0.035   0.0    5.202   0.0
  16 replaceArrayAccessesWithRefTemps     0.000    0.000    0.012    0.012   0.0    5.214   0.0
  17 processIteratorYields                0.012    0.000    0.012    0.024   0.0    5.238   0.0
  18 flattenFunctions                     0.020    0.000    0.011    0.032   0.0    5.270   0.0
  19 cullOverReferences                   0.146    0.001    0.025    0.172   0.0    5.441   0.0
  20 lowerErrorHandling                   0.055    0.000    0.009    0.064   0.0    5.505   0.0
  21 callDestructors                      0.140    0.000    0.013    0.153   0.0    5.658   0.0
  22 lowerIterators                       0.263    0.011    0.015    0.290   0.0    5.948   0.0
  23 parallel                             0.023    0.011    0.012    0.047   0.0    5.995   0.0
  24 prune                                0.077    0.011    0.025    0.114   0.0    6.108   0.0
  25 bulkCopyRecords                      0.000    0.011    0.011    0.022   0.0    6.130   0.0
  26 removeUnnecessaryAutoCopyCalls       0.010    0.010    0.011    0.031   0.0    6.162   0.0
  27 inlineFunctions                      0.985    0.052    0.091    1.128   0.0    7.290   0.0
  28 scalarReplace                        0.279    0.049    0.064    0.392   0.0    7.681   0.0
  29 refPropagation                       1.436    0.046    0.072    1.554   0.0    9.235   0.0
  30 copyPropagation                    422.261    0.044    0.076  422.381   1.1  431.616   1.1
  31 deadCodeElimination                  1.296    0.042    0.048    1.386   0.0  433.002   1.2
  32 removeWrapRecords                    0.000    0.042    0.042    0.085   0.0  433.087   1.2
  33 removeEmptyRecords                   0.000    0.042    0.042    0.084   0.0  433.171   1.2
  34 localizeGlobals                      0.161    0.042    0.044    0.248   0.0  433.419   1.2
  35 loopInvariantCodeMotion              0.830    0.042    0.043    0.915   0.0  434.334   1.2
  36 prune2                               0.273    0.042    0.043    0.358   0.0  434.691   1.2
  37 returnStarTuplesByRefArgs            0.091    0.042    0.048    0.181   0.0  434.872   1.2
  38 insertWideReferences                 0.010    0.042    0.043    0.094   0.0  434.967   1.2
  39 optimizeOnClauses                    0.139    0.042    0.043    0.223   0.0  435.190   1.2
  40 addInitCalls                         0.001    0.042    0.043    0.085   0.0  435.276   1.2
  41 insertLineNumbers                    0.606    0.048    0.060    0.714   0.0  435.990   1.2
  42 denormalize                          1.351    0.000    0.087    1.438   0.0  437.428   1.2
  43 codegen                              3.617    0.000    0.057    3.674   0.0  441.102   1.2
  44 makeBinary                         37144.184    0.001    1.067  37145.252  98.8  37586.354 100.0
     driverCleanup                        0.000    0.000    0.000    0.000   0.0  37586.354 100.0

     total time                         37583.090    0.746    2.517  37586.354

Pass               Name                   Main    Check    Clean     Time    %     Accum    %  
---- ---------------------------------  -------  -------  -------  ------- -----  ------- -----
  44 makeBinary                         37144.184    0.001    1.067  37145.252  98.8  37145.252  98.8
  30 copyPropagation                    422.261    0.044    0.076  422.381   1.1  37567.633 100.0
  43 codegen                              3.617    0.000    0.057    3.674   0.0  37571.307 100.0
  13 resolve                              3.119    0.017    0.213    3.349   0.0  37574.656 100.0
  29 refPropagation                       1.436    0.046    0.072    1.554   0.0  37576.210 100.0
  42 denormalize                          1.351    0.000    0.087    1.438   0.0  37577.648 100.0
  31 deadCodeElimination                  1.296    0.042    0.048    1.386   0.0  37579.034 100.0
  27 inlineFunctions                      0.985    0.052    0.091    1.128   0.0  37580.162 100.0
  35 loopInvariantCodeMotion              0.830    0.042    0.043    0.915   0.0  37581.077 100.0
  41 insertLineNumbers                    0.606    0.048    0.060    0.714   0.0  37581.791 100.0
   9 normalize                            0.607    0.000    0.016    0.622   0.0  37582.414 100.0
   1 parse                                0.377    0.000    0.018    0.395   0.0  37582.808 100.0
  28 scalarReplace                        0.279    0.049    0.064    0.392   0.0  37583.200 100.0
   7 scopeResolve                         0.372    0.000    0.012    0.384   0.0  37583.584 100.0
  36 prune2                               0.273    0.042    0.043    0.358   0.0  37583.941 100.0
  22 lowerIterators                       0.263    0.011    0.015    0.290   0.0  37584.231 100.0
  34 localizeGlobals                      0.161    0.042    0.044    0.248   0.0  37584.478 100.0
  39 optimizeOnClauses                    0.139    0.042    0.043    0.223   0.0  37584.702 100.0
  11 buildDefaultFunctions                0.185    0.000    0.013    0.198   0.0  37584.900 100.0
  37 returnStarTuplesByRefArgs            0.091    0.042    0.048    0.181   0.0  37585.081 100.0
  19 cullOverReferences                   0.146    0.001    0.025    0.172   0.0  37585.252 100.0
  21 callDestructors                      0.140    0.000    0.013    0.153   0.0  37585.405 100.0
  24 prune                                0.077    0.011    0.025    0.114   0.0  37585.519 100.0
  38 insertWideReferences                 0.010    0.042    0.043    0.094   0.0  37585.614 100.0
  40 addInitCalls                         0.001    0.042    0.043    0.085   0.0  37585.699 100.0
  32 removeWrapRecords                    0.000    0.042    0.042    0.085   0.0  37585.784 100.0
  33 removeEmptyRecords                   0.000    0.042    0.042    0.084   0.0  37585.868 100.0
  20 lowerErrorHandling                   0.055    0.000    0.009    0.064   0.0  37585.932 100.0
     init                                 0.059    0.000    0.000    0.059   0.0  37585.991 100.0
  23 parallel                             0.023    0.011    0.012    0.047   0.0  37586.038 100.0
   6 cleanup                              0.035    0.000    0.006    0.042   0.0  37586.080 100.0
   2 checkParsed                          0.034    0.001    0.006    0.041   0.0  37586.120 100.0
  15 checkResolved                        0.023    0.000    0.012    0.035   0.0  37586.155 100.0
  18 flattenFunctions                     0.020    0.000    0.011    0.032   0.0  37586.187 100.0
  26 removeUnnecessaryAutoCopyCalls       0.010    0.010    0.011    0.031   0.0  37586.218 100.0
  12 createTaskFunctions                  0.002    0.011    0.011    0.024   0.0  37586.242 100.0
  17 processIteratorYields                0.012    0.000    0.012    0.024   0.0  37586.266 100.0
  25 bulkCopyRecords                      0.000    0.011    0.011    0.022   0.0  37586.288 100.0
  14 resolveIntents                       0.005    0.000    0.011    0.017   0.0  37586.305 100.0
  10 checkNormalized                      0.003    0.000    0.011    0.014   0.0  37586.319 100.0
  16 replaceArrayAccessesWithRefTemps     0.000    0.000    0.012    0.012   0.0  37586.331 100.0
   5 expandExternArrayCalls               0.000    0.000    0.006    0.006   0.0  37586.337 100.0
   8 flattenClasses                       0.000    0.000    0.006    0.006   0.0  37586.343 100.0
   3 docs                                 0.000    0.000    0.006    0.006   0.0  37586.348 100.0
   4 readExternC                          0.000    0.000    0.006    0.006   0.0  37586.354 100.0
     driverCleanup                        0.000    0.000    0.000    0.000   0.0  37586.354 100.0
     startup                              0.000    0.000    0.000    0.000   0.0  37586.354 100.0

     total time                         37583.090    0.746    2.517  37586.354

And from the time command of that compilation:

real    626m26.839s
user    626m15.228s
sys     0m10.160s
daviditen commented 6 years ago
  • Is the stack overflow specifically occurring when codegen'ing the large set of + operations? (as opposed to the statements that precede it?)

It's codegen'ing the + operations. There is a gigantic move statement that has a bunch of casts to int(64), dereferences of temps, and calls to +.

A little piece of the big statement is:

cast( type int(64)[10] +( cast( type int(64)[10]
                                +( cast( type int(64)[10]
                                         +( deref( call_tmp[1725427] 112 54 )
                                            deref( call_tmp[1725565] 112 54 ) ) )
                                   deref( call_tmp[1725703] 112 54 ) ) )
                          deref( call_tmp[1725841] 112 54 ) ) )
  • how deep is the recursion?

On my computer it segfaults at 573 frames deep.

  • what's the constant multiplier on the number of stack frames per + operation?

A pattern like this is repeating in the backtrace, so it looks like 5 stack frames per +.

    frame #536: 0x00000001001abfd7 chpl`CallExpr::codegen(this=0x000000011ea1e690) at expr.cpp:3376
    frame #537: 0x0000000100077fb4 chpl`baseASTCodegen(ast=0x000000011ea1e690) at baseAST.cpp:677
    frame #538: 0x00000001001ebe0a chpl`GenRet::GenRet(this=0x00007fff5fba4b68, ast=0x000000011ea1e690) at genret.h:145
    frame #539: 0x00000001001c576d chpl`GenRet::GenRet(this=0x00007fff5fba4b68, ast=0x000000011ea1e690) at genret.h:144
    frame #540: 0x00000001001bf39c chpl`CallExpr::codegenPrimitive(this=0x00000001208c3fc0) at expr.cpp:4736

The compiler option --no-denormalize causes the gigantic statement to stay split up into smaller statements, and can work around this issue.

bradcray commented 6 years ago

The compiler option --no-denormalize causes the gigantic statement to stay split up into smaller statements, and can work around this issue.

Oh, that's interesting. Tagging @mppf and @e-kayrakli on this as it suggests we may want to put a limit on how aggressively we denormalize.

cassella commented 6 years ago

The original test still segfaults on my laptop today.

Would it be possible/reasonable to create a future with enough accesses that it'll fail on all the test systems?

mppf commented 6 years ago

I can reproduce the original bug if I run ulimit -s 1024 first. It looks like limiting denormalize can solve the issue, but I'm also seeing codegenPrimitive using 13KB of stack space per call. I'm looking at a way to solve that problem.

cassella commented 6 years ago

FWIW, this isn't getting in my way or anything. I was just thinking about trying to submit a future for it.

mppf commented 6 years ago

PR #10187 has a fix for this and gets codegenPrimitive stack usage down to 160b. There are some other related calls that occur in the recursion (like CallExpr::codegen) that are still around 1KB but that seems good enough for now.