Closed ian-bertolacci closed 6 years ago
I've broken it on our server as well, when generating with access > 154.
It fails (with 93 accesses) on my laptop as well, 6 hundred-odd stack frames in.
fortytwo@magrathea:~/src/chpl/ogg$ chpl --gdb output.chpl
Reading symbols from chpl...done.
Breakpoint 1 at 0x6fcd39: file misc.cpp, line 61.
(gdb) r
Starting program: /home/fortytwo/src/chapel/bin/linux64/chpl output.chpl
Program received signal SIGSEGV, Segmentation fault.
0x000000000052138c in CallExpr::codegenPrimitive (
this=<error reading variable: Cannot access memory at address 0x7fffff7f6330>) at expr.cpp:3490
3490 GenRet CallExpr::codegenPrimitive() {
(gdb) i s
#0 0x000000000052138c in CallExpr::codegenPrimitive (
this=<error reading variable: Cannot access memory at address 0x7fffff7f6330>) at expr.cpp:3490
#1 0x000000000052077e in CallExpr::codegen (this=0xb10e9a0) at expr.cpp:3376
#2 0x000000000045c9d2 in baseASTCodegen (ast=0xb10e9a0) at baseAST.cpp:677
#3 0x0000000000535782 in GenRet::GenRet (this=0x7fffff805ed0, ast=0xb10e9a0) at ../include/genret.h:145
#4 0x00000000005233f2 in CallExpr::codegenPrimitive (this=0xb02cc0) at expr.cpp:3792
#5 0x000000000052077e in CallExpr::codegen (this=0xb02cc0) at expr.cpp:3376
#6 0x000000000045c9d2 in baseASTCodegen (ast=0xb02cc0) at baseAST.cpp:677
#7 0x0000000000535782 in GenRet::GenRet (this=0x7fffff81c670, ast=0xb02cc0) at ../include/genret.h:145
#8 0x000000000052a8b7 in CallExpr::codegenPrimitive (this=0x2bace80) at expr.cpp:4736
#9 0x000000000052077e in CallExpr::codegen (this=0x2bace80) at expr.cpp:3376
#10 0x000000000045c9d2 in baseASTCodegen (ast=0x2bace80) at baseAST.cpp:677
#11 0x0000000000535782 in GenRet::GenRet (this=0x7fffff81fbb0, ast=0x2bace80) at ../include/genret.h:145
...
#626 0x000000000045c9d2 in baseASTCodegen (ast=0x1889e90) at baseAST.cpp:677
#627 0x0000000000535782 in GenRet::GenRet (this=0x7ffffffe47f0, ast=0x1889e90) at ../include/genret.h:145
#628 0x000000000052341b in CallExpr::codegenPrimitive (this=0xcd89eb0) at expr.cpp:3792
#629 0x000000000052077e in CallExpr::codegen (this=0xcd89eb0) at expr.cpp:3376
#630 0x000000000045c9d2 in baseASTCodegen (ast=0xcd89eb0) at baseAST.cpp:677
#631 0x0000000000535782 in GenRet::GenRet (this=0x7ffffffef7f0, ast=0xcd89eb0) at ../include/genret.h:145
#632 0x0000000000530dd7 in CallExpr::codegenPrimMove (this=0xcd89db0) at expr.cpp:5185
#633 0x0000000000522465 in CallExpr::codegenPrimitive (this=0xcd89db0) at expr.cpp:3618
#634 0x000000000052077e in CallExpr::codegen (this=0xcd89db0) at expr.cpp:3376
#635 0x000000000053975f in AList::codegen (this=0xafc6b30, separator=0x7183c7 "") at alist.cpp:35
#636 0x0000000000539daa in BlockStmt::codegen (this=0xafc6ae0) at stmt.cpp:115
#637 0x000000000053d750 in FnSymbol::codegenDef (this=0xafc63b0) at symbol.cpp:1536
#638 0x000000000053db45 in ModuleSymbol::codegenDef (this=0x253b190) at symbol.cpp:1654
#639 0x0000000000504b2f in codegen () at codegen.cpp:2261
#640 0x00000000005661f8 in runPass (tracker=..., passIndex=40, isChpldoc=false) at runpasses.cpp:202
#641 0x000000000056610c in runPasses (tracker=..., isChpldoc=false) at runpasses.cpp:169
#642 0x000000000056131d in main (argc=2, argv=0x7fffffffde78) at driver.cpp:1350
I've changed the code generator, to generate compound assignment statements (s += r0; s+= r1;...
) instead of a huge addition expression (s=r0+r1...;
).
This seems to improve things, and I can now compile with a very large number of accesses.
Im not sure what the upper bound is (for reasons that will become obvious) but I was able to compile with 30,000 accesses on the research server.
It did take ~11 hours to find that out. Im not sure how much time is spent in the Chapel compiler vs the C compiler, but they both take a significant portion of time.
Wonder what happens if I add --fast
to the mix.....
#!/usr/bin/env python3
import argparse
parser = argparse.ArgumentParser()
parser.add_argument( "--accesses", type=int, default=100 )
parser.add_argument( "--N", type=int, default=1000 )
args = parser.parse_args()
accesses = args.accesses
N = args.N
modifier = "ref"
with open("gen_{}_{}.chpl".format(N,accesses), "w") as output:
output.write(
"\n".join(
[
"use Time;",
"// writeln substitute",
"proc w( v ){",
"\twriteln( v );",
"}",
"proc main(){",
"\tconst N : int = {};".format(N),
"\tvar a : [0..#N] int;",
]
)
)
for i in range(1):
output.write( "w(t{}(a));".format( i ) )
output.write("}// Close main\n")
for proc_i in range(1):
output.write("proc t{}(ref a){{var t=new Timer();t.start();".format(proc_i))
for access_i in range(accesses):
index = access_i % N
output.write("{} r{}=a[{}];".format(modifier, access_i, index) )
output.write("t.stop();var s=r0;")
for access_i in range(1,accesses):
output.write("s+=r{};".format(access_i))
output.write(";return (t.elapsed()/{},s);}}".format(accesses));
(I've also added some cli args, and it generates to a file with a meaningful name)
This looks like a stack overflow during codegen. I can reproduce it on my MBP laptop with 56 accesses, but not 55.
Interesting. So the "machine dependent"-ness of this issue boils down to how much stack space the OS allocates to an application? I wonder if I upgraded my Laptop to 32GB if I would could compile with 154 accesses under the original code scheme.
@daviditen : Out of curiosity do you know:
@bradcray I just tried to compile my original code with --print-commands --print-passes
and got the following:
> chpl --print-commands --print-passes ./test.chpl
init : 0.067 seconds
parse : 0.158 seconds
checkParsed : 0.020 seconds
docs : 0.003 seconds
readExternC : 0.003 seconds
expandExternArrayCalls : 0.004 seconds
cleanup : 0.032 seconds
scopeResolve : 0.264 seconds
flattenClasses : 0.003 seconds
normalize : 0.370 seconds
checkNormalized : 0.011 seconds
buildDefaultFunctions : 0.200 seconds
createTaskFunctions : 0.021 seconds
resolve : 2.135 seconds
resolveIntents : 0.005 seconds
checkResolved : 0.012 seconds
replaceArrayAccessesWithRefTemps : 0.007 seconds
processIteratorYields : 0.013 seconds
flattenFunctions : 0.016 seconds
cullOverReferences : 0.012 seconds
lowerErrorHandling : 0.019 seconds
callDestructors : 0.039 seconds
lowerIterators : 0.138 seconds
parallel : 0.021 seconds
prune : 0.055 seconds
bulkCopyRecords : 0.005 seconds
removeUnnecessaryAutoCopyCalls : 0.022 seconds
inlineFunctions : 0.119 seconds
scalarReplace : 0.025 seconds
refPropagation : 0.072 seconds
copyPropagation : 0.125 seconds
deadCodeElimination : 0.051 seconds
removeWrapRecords : 0.006 seconds
removeEmptyRecords : 0.005 seconds
localizeGlobals : 0.013 seconds
loopInvariantCodeMotion : 0.048 seconds
prune2 : 0.012 seconds
returnStarTuplesByRefArgs : 0.006 seconds
insertWideReferences : 0.003 seconds
optimizeOnClauses : 0.017 seconds
addInitCalls : 0.006 seconds
insertLineNumbers : 0.028 seconds
denormalize : 0.052 seconds
Segmentation fault (core dumped)
Also, I'm beginning to doubt my ~11 hour build time. I just tried the same compile with --fast
and it took 7 minutes, which was very surprising, so I'm trying it again. Maybe my parallel build idea was not so good...
Also, the binary is much smaller, 3.6MB with --fast
vs 25MB without.
Also, the original 93 accesses code also doesn't compile under --fast
.
Now this is really interesting, and could probably be its own issue. As I said in my third comment, I updated my code generator, and can now compile benchmarks with tens of thousands of accesses. In that example, I generated and compiled a program with 30K accesses of a 100K element array.
I also mentioned that it took ~11 hours to compile without --fast
.
Yesterday I found that with --fast
, compilation only takes 7 minutes
That's one hell of a difference, and the total opposite of what I expected.
The --print-passes-file
(below) show that the makeBinary
'pass' contributes the most to the compile time (404 seconds with --fast
and 37144 seconds without --fast
).
Here is the --print-passes-file
output from the compilation with --fast
:
init : 0.060 seconds
parse : 0.347 seconds
checkParsed : 0.037 seconds
docs : 0.005 seconds
readExternC : 0.005 seconds
expandExternArrayCalls : 0.006 seconds
cleanup : 0.038 seconds
scopeResolve : 0.380 seconds
flattenClasses : 0.006 seconds
normalize : 0.606 seconds
checkNormalized : 0.014 seconds
buildDefaultFunctions : 0.185 seconds
createTaskFunctions : 0.024 seconds
resolve : 3.211 seconds
resolveIntents : 0.017 seconds
checkResolved : 0.034 seconds
replaceArrayAccessesWithRefTemps : 0.012 seconds
processIteratorYields : 0.026 seconds
flattenFunctions : 0.029 seconds
cullOverReferences : 0.167 seconds
lowerErrorHandling : 0.047 seconds
callDestructors : 0.165 seconds
lowerIterators : 0.281 seconds
parallel : 0.049 seconds
prune : 0.108 seconds
bulkCopyRecords : 0.022 seconds
removeUnnecessaryAutoCopyCalls : 0.030 seconds
inlineFunctions : 0.594 seconds
scalarReplace : 0.329 seconds
refPropagation : 0.916 seconds
copyPropagation : 1.415 seconds
deadCodeElimination : 0.455 seconds
removeWrapRecords : 0.030 seconds
removeEmptyRecords : 0.030 seconds
localizeGlobals : 0.099 seconds
loopInvariantCodeMotion : 0.129 seconds
prune2 : 0.138 seconds
returnStarTuplesByRefArgs : 0.055 seconds
insertWideReferences : 0.033 seconds
optimizeOnClauses : 0.071 seconds
addInitCalls : 0.030 seconds
insertLineNumbers : 0.193 seconds
denormalize : 0.720 seconds
codegen : 0.859 seconds
makeBinary : 404.932 seconds
driverCleanup : 0.000 seconds
total time : 417.207 seconds
Pass Name Main Check Clean Time % Accum %
---- --------------------------------- ------- ------- ------- ------- ----- ------- -----
startup 0.000 0.000 0.000 0.000 0.0 0.000 0.0
init 0.060 0.000 0.000 0.060 0.0 0.060 0.0
1 parse 0.332 0.000 0.015 0.347 0.1 0.407 0.1
2 checkParsed 0.030 0.001 0.006 0.037 0.0 0.444 0.1
3 docs 0.000 0.000 0.005 0.005 0.0 0.450 0.1
4 readExternC 0.000 0.000 0.005 0.005 0.0 0.455 0.1
5 expandExternArrayCalls 0.000 0.000 0.005 0.006 0.0 0.461 0.1
6 cleanup 0.032 0.000 0.006 0.038 0.0 0.499 0.1
7 scopeResolve 0.368 0.000 0.012 0.380 0.1 0.879 0.2
8 flattenClasses 0.000 0.000 0.006 0.006 0.0 0.885 0.2
9 normalize 0.591 0.000 0.015 0.606 0.1 1.491 0.4
10 checkNormalized 0.003 0.000 0.011 0.014 0.0 1.505 0.4
11 buildDefaultFunctions 0.174 0.000 0.012 0.185 0.0 1.690 0.4
12 createTaskFunctions 0.002 0.011 0.011 0.024 0.0 1.714 0.4
13 resolve 2.988 0.017 0.206 3.211 0.8 4.925 1.2
14 resolveIntents 0.005 0.000 0.011 0.017 0.0 4.942 1.2
15 checkResolved 0.022 0.000 0.011 0.034 0.0 4.976 1.2
16 replaceArrayAccessesWithRefTemps 0.000 0.000 0.011 0.012 0.0 4.988 1.2
17 processIteratorYields 0.014 0.000 0.012 0.026 0.0 5.014 1.2
18 flattenFunctions 0.017 0.000 0.011 0.029 0.0 5.043 1.2
19 cullOverReferences 0.143 0.001 0.024 0.167 0.0 5.210 1.2
20 lowerErrorHandling 0.038 0.000 0.009 0.047 0.0 5.257 1.3
21 callDestructors 0.152 0.000 0.012 0.165 0.0 5.421 1.3
22 lowerIterators 0.254 0.013 0.015 0.281 0.1 5.703 1.4
23 parallel 0.022 0.015 0.012 0.049 0.0 5.752 1.4
24 prune 0.072 0.012 0.024 0.108 0.0 5.860 1.4
25 bulkCopyRecords 0.000 0.010 0.011 0.022 0.0 5.882 1.4
26 removeUnnecessaryAutoCopyCalls 0.009 0.010 0.011 0.030 0.0 5.912 1.4
27 inlineFunctions 0.480 0.026 0.088 0.594 0.1 6.506 1.6
28 scalarReplace 0.240 0.028 0.061 0.329 0.1 6.835 1.6
29 refPropagation 0.847 0.021 0.048 0.916 0.2 7.751 1.9
30 copyPropagation 1.280 0.017 0.118 1.415 0.3 9.167 2.2
31 deadCodeElimination 0.422 0.015 0.018 0.455 0.1 9.621 2.3
32 removeWrapRecords 0.000 0.015 0.015 0.030 0.0 9.651 2.3
33 removeEmptyRecords 0.000 0.015 0.015 0.030 0.0 9.681 2.3
34 localizeGlobals 0.070 0.015 0.015 0.099 0.0 9.780 2.3
35 loopInvariantCodeMotion 0.099 0.015 0.015 0.129 0.0 9.909 2.4
36 prune2 0.108 0.015 0.015 0.138 0.0 10.047 2.4
37 returnStarTuplesByRefArgs 0.025 0.015 0.015 0.055 0.0 10.103 2.4
38 insertWideReferences 0.003 0.015 0.015 0.033 0.0 10.136 2.4
39 optimizeOnClauses 0.041 0.015 0.015 0.071 0.0 10.207 2.4
40 addInitCalls 0.001 0.015 0.015 0.030 0.0 10.237 2.5
41 insertLineNumbers 0.156 0.015 0.021 0.193 0.0 10.430 2.5
42 denormalize 0.667 0.000 0.053 0.720 0.2 11.150 2.7
43 codegen 0.841 0.000 0.018 0.859 0.2 12.009 2.9
44 makeBinary 404.912 0.000 0.285 405.198 97.1 417.207 100.0
driverCleanup 0.000 0.000 0.000 0.000 0.0 417.207 100.0
total time 415.524 0.345 1.337 417.207
Pass Name Main Check Clean Time % Accum %
---- --------------------------------- ------- ------- ------- ------- ----- ------- -----
44 makeBinary 404.912 0.000 0.285 405.198 97.1 405.198 97.1
13 resolve 2.988 0.017 0.206 3.211 0.8 408.409 97.9
30 copyPropagation 1.280 0.017 0.118 1.415 0.3 409.824 98.2
29 refPropagation 0.847 0.021 0.048 0.916 0.2 410.740 98.5
43 codegen 0.841 0.000 0.018 0.859 0.2 411.599 98.7
42 denormalize 0.667 0.000 0.053 0.720 0.2 412.320 98.8
9 normalize 0.591 0.000 0.015 0.606 0.1 412.926 99.0
27 inlineFunctions 0.480 0.026 0.088 0.594 0.1 413.520 99.1
31 deadCodeElimination 0.422 0.015 0.018 0.455 0.1 413.975 99.2
7 scopeResolve 0.368 0.000 0.012 0.380 0.1 414.354 99.3
1 parse 0.332 0.000 0.015 0.347 0.1 414.701 99.4
28 scalarReplace 0.240 0.028 0.061 0.329 0.1 415.031 99.5
22 lowerIterators 0.254 0.013 0.015 0.281 0.1 415.312 99.5
41 insertLineNumbers 0.156 0.015 0.021 0.193 0.0 415.505 99.6
11 buildDefaultFunctions 0.174 0.000 0.012 0.185 0.0 415.690 99.6
19 cullOverReferences 0.143 0.001 0.024 0.167 0.0 415.857 99.7
21 callDestructors 0.152 0.000 0.012 0.165 0.0 416.022 99.7
36 prune2 0.108 0.015 0.015 0.138 0.0 416.160 99.7
35 loopInvariantCodeMotion 0.099 0.015 0.015 0.129 0.0 416.289 99.8
24 prune 0.072 0.012 0.024 0.108 0.0 416.397 99.8
34 localizeGlobals 0.070 0.015 0.015 0.099 0.0 416.496 99.8
39 optimizeOnClauses 0.041 0.015 0.015 0.071 0.0 416.567 99.8
init 0.060 0.000 0.000 0.060 0.0 416.628 99.9
37 returnStarTuplesByRefArgs 0.025 0.015 0.015 0.055 0.0 416.683 99.9
23 parallel 0.022 0.015 0.012 0.049 0.0 416.732 99.9
20 lowerErrorHandling 0.038 0.000 0.009 0.047 0.0 416.779 99.9
6 cleanup 0.032 0.000 0.006 0.038 0.0 416.818 99.9
2 checkParsed 0.030 0.001 0.006 0.037 0.0 416.854 99.9
15 checkResolved 0.022 0.000 0.011 0.034 0.0 416.888 99.9
38 insertWideReferences 0.003 0.015 0.015 0.033 0.0 416.921 99.9
26 removeUnnecessaryAutoCopyCalls 0.009 0.010 0.011 0.030 0.0 416.952 99.9
40 addInitCalls 0.001 0.015 0.015 0.030 0.0 416.982 99.9
33 removeEmptyRecords 0.000 0.015 0.015 0.030 0.0 417.012 100.0
32 removeWrapRecords 0.000 0.015 0.015 0.030 0.0 417.042 100.0
18 flattenFunctions 0.017 0.000 0.011 0.029 0.0 417.070 100.0
17 processIteratorYields 0.014 0.000 0.012 0.026 0.0 417.096 100.0
12 createTaskFunctions 0.002 0.011 0.011 0.024 0.0 417.120 100.0
25 bulkCopyRecords 0.000 0.010 0.011 0.022 0.0 417.142 100.0
14 resolveIntents 0.005 0.000 0.011 0.017 0.0 417.159 100.0
10 checkNormalized 0.003 0.000 0.011 0.014 0.0 417.173 100.0
16 replaceArrayAccessesWithRefTemps 0.000 0.000 0.011 0.012 0.0 417.184 100.0
8 flattenClasses 0.000 0.000 0.006 0.006 0.0 417.190 100.0
5 expandExternArrayCalls 0.000 0.000 0.005 0.006 0.0 417.196 100.0
3 docs 0.000 0.000 0.005 0.005 0.0 417.202 100.0
4 readExternC 0.000 0.000 0.005 0.005 0.0 417.207 100.0
startup 0.000 0.000 0.000 0.000 0.0 417.207 100.0
driverCleanup 0.000 0.000 0.000 0.000 0.0 417.207 100.0
total time 415.524 0.345 1.337 417.207
And from the time command of that compilation:
real 6m57.386s
user 6m55.520s
sys 0m1.752s
Here is the --print-passes-file
output from the compilation without --fast
:
init : 0.059 seconds
parse : 0.395 seconds
checkParsed : 0.041 seconds
docs : 0.006 seconds
readExternC : 0.006 seconds
expandExternArrayCalls : 0.006 seconds
cleanup : 0.042 seconds
scopeResolve : 0.384 seconds
flattenClasses : 0.006 seconds
normalize : 0.622 seconds
checkNormalized : 0.014 seconds
buildDefaultFunctions : 0.198 seconds
createTaskFunctions : 0.024 seconds
resolve : 3.349 seconds
resolveIntents : 0.017 seconds
checkResolved : 0.035 seconds
replaceArrayAccessesWithRefTemps : 0.012 seconds
processIteratorYields : 0.024 seconds
flattenFunctions : 0.032 seconds
cullOverReferences : 0.172 seconds
lowerErrorHandling : 0.064 seconds
callDestructors : 0.153 seconds
lowerIterators : 0.290 seconds
parallel : 0.047 seconds
prune : 0.114 seconds
bulkCopyRecords : 0.022 seconds
removeUnnecessaryAutoCopyCalls : 0.031 seconds
inlineFunctions : 1.128 seconds
scalarReplace : 0.392 seconds
refPropagation : 1.554 seconds
copyPropagation : 422.381 seconds
deadCodeElimination : 1.386 seconds
removeWrapRecords : 0.085 seconds
removeEmptyRecords : 0.084 seconds
localizeGlobals : 0.248 seconds
loopInvariantCodeMotion : 0.915 seconds
prune2 : 0.358 seconds
returnStarTuplesByRefArgs : 0.181 seconds
insertWideReferences : 0.094 seconds
optimizeOnClauses : 0.223 seconds
addInitCalls : 0.085 seconds
insertLineNumbers : 0.714 seconds
denormalize : 1.438 seconds
codegen : 3.674 seconds
makeBinary :37144.535 seconds
driverCleanup : 0.000 seconds
total time :37586.354 seconds
Pass Name Main Check Clean Time % Accum %
---- --------------------------------- ------- ------- ------- ------- ----- ------- -----
startup 0.000 0.000 0.000 0.000 0.0 0.000 0.0
init 0.059 0.000 0.000 0.059 0.0 0.059 0.0
1 parse 0.377 0.000 0.018 0.395 0.0 0.454 0.0
2 checkParsed 0.034 0.001 0.006 0.041 0.0 0.495 0.0
3 docs 0.000 0.000 0.006 0.006 0.0 0.500 0.0
4 readExternC 0.000 0.000 0.006 0.006 0.0 0.506 0.0
5 expandExternArrayCalls 0.000 0.000 0.006 0.006 0.0 0.512 0.0
6 cleanup 0.035 0.000 0.006 0.042 0.0 0.554 0.0
7 scopeResolve 0.372 0.000 0.012 0.384 0.0 0.938 0.0
8 flattenClasses 0.000 0.000 0.006 0.006 0.0 0.944 0.0
9 normalize 0.607 0.000 0.016 0.622 0.0 1.566 0.0
10 checkNormalized 0.003 0.000 0.011 0.014 0.0 1.580 0.0
11 buildDefaultFunctions 0.185 0.000 0.013 0.198 0.0 1.778 0.0
12 createTaskFunctions 0.002 0.011 0.011 0.024 0.0 1.802 0.0
13 resolve 3.119 0.017 0.213 3.349 0.0 5.151 0.0
14 resolveIntents 0.005 0.000 0.011 0.017 0.0 5.168 0.0
15 checkResolved 0.023 0.000 0.012 0.035 0.0 5.202 0.0
16 replaceArrayAccessesWithRefTemps 0.000 0.000 0.012 0.012 0.0 5.214 0.0
17 processIteratorYields 0.012 0.000 0.012 0.024 0.0 5.238 0.0
18 flattenFunctions 0.020 0.000 0.011 0.032 0.0 5.270 0.0
19 cullOverReferences 0.146 0.001 0.025 0.172 0.0 5.441 0.0
20 lowerErrorHandling 0.055 0.000 0.009 0.064 0.0 5.505 0.0
21 callDestructors 0.140 0.000 0.013 0.153 0.0 5.658 0.0
22 lowerIterators 0.263 0.011 0.015 0.290 0.0 5.948 0.0
23 parallel 0.023 0.011 0.012 0.047 0.0 5.995 0.0
24 prune 0.077 0.011 0.025 0.114 0.0 6.108 0.0
25 bulkCopyRecords 0.000 0.011 0.011 0.022 0.0 6.130 0.0
26 removeUnnecessaryAutoCopyCalls 0.010 0.010 0.011 0.031 0.0 6.162 0.0
27 inlineFunctions 0.985 0.052 0.091 1.128 0.0 7.290 0.0
28 scalarReplace 0.279 0.049 0.064 0.392 0.0 7.681 0.0
29 refPropagation 1.436 0.046 0.072 1.554 0.0 9.235 0.0
30 copyPropagation 422.261 0.044 0.076 422.381 1.1 431.616 1.1
31 deadCodeElimination 1.296 0.042 0.048 1.386 0.0 433.002 1.2
32 removeWrapRecords 0.000 0.042 0.042 0.085 0.0 433.087 1.2
33 removeEmptyRecords 0.000 0.042 0.042 0.084 0.0 433.171 1.2
34 localizeGlobals 0.161 0.042 0.044 0.248 0.0 433.419 1.2
35 loopInvariantCodeMotion 0.830 0.042 0.043 0.915 0.0 434.334 1.2
36 prune2 0.273 0.042 0.043 0.358 0.0 434.691 1.2
37 returnStarTuplesByRefArgs 0.091 0.042 0.048 0.181 0.0 434.872 1.2
38 insertWideReferences 0.010 0.042 0.043 0.094 0.0 434.967 1.2
39 optimizeOnClauses 0.139 0.042 0.043 0.223 0.0 435.190 1.2
40 addInitCalls 0.001 0.042 0.043 0.085 0.0 435.276 1.2
41 insertLineNumbers 0.606 0.048 0.060 0.714 0.0 435.990 1.2
42 denormalize 1.351 0.000 0.087 1.438 0.0 437.428 1.2
43 codegen 3.617 0.000 0.057 3.674 0.0 441.102 1.2
44 makeBinary 37144.184 0.001 1.067 37145.252 98.8 37586.354 100.0
driverCleanup 0.000 0.000 0.000 0.000 0.0 37586.354 100.0
total time 37583.090 0.746 2.517 37586.354
Pass Name Main Check Clean Time % Accum %
---- --------------------------------- ------- ------- ------- ------- ----- ------- -----
44 makeBinary 37144.184 0.001 1.067 37145.252 98.8 37145.252 98.8
30 copyPropagation 422.261 0.044 0.076 422.381 1.1 37567.633 100.0
43 codegen 3.617 0.000 0.057 3.674 0.0 37571.307 100.0
13 resolve 3.119 0.017 0.213 3.349 0.0 37574.656 100.0
29 refPropagation 1.436 0.046 0.072 1.554 0.0 37576.210 100.0
42 denormalize 1.351 0.000 0.087 1.438 0.0 37577.648 100.0
31 deadCodeElimination 1.296 0.042 0.048 1.386 0.0 37579.034 100.0
27 inlineFunctions 0.985 0.052 0.091 1.128 0.0 37580.162 100.0
35 loopInvariantCodeMotion 0.830 0.042 0.043 0.915 0.0 37581.077 100.0
41 insertLineNumbers 0.606 0.048 0.060 0.714 0.0 37581.791 100.0
9 normalize 0.607 0.000 0.016 0.622 0.0 37582.414 100.0
1 parse 0.377 0.000 0.018 0.395 0.0 37582.808 100.0
28 scalarReplace 0.279 0.049 0.064 0.392 0.0 37583.200 100.0
7 scopeResolve 0.372 0.000 0.012 0.384 0.0 37583.584 100.0
36 prune2 0.273 0.042 0.043 0.358 0.0 37583.941 100.0
22 lowerIterators 0.263 0.011 0.015 0.290 0.0 37584.231 100.0
34 localizeGlobals 0.161 0.042 0.044 0.248 0.0 37584.478 100.0
39 optimizeOnClauses 0.139 0.042 0.043 0.223 0.0 37584.702 100.0
11 buildDefaultFunctions 0.185 0.000 0.013 0.198 0.0 37584.900 100.0
37 returnStarTuplesByRefArgs 0.091 0.042 0.048 0.181 0.0 37585.081 100.0
19 cullOverReferences 0.146 0.001 0.025 0.172 0.0 37585.252 100.0
21 callDestructors 0.140 0.000 0.013 0.153 0.0 37585.405 100.0
24 prune 0.077 0.011 0.025 0.114 0.0 37585.519 100.0
38 insertWideReferences 0.010 0.042 0.043 0.094 0.0 37585.614 100.0
40 addInitCalls 0.001 0.042 0.043 0.085 0.0 37585.699 100.0
32 removeWrapRecords 0.000 0.042 0.042 0.085 0.0 37585.784 100.0
33 removeEmptyRecords 0.000 0.042 0.042 0.084 0.0 37585.868 100.0
20 lowerErrorHandling 0.055 0.000 0.009 0.064 0.0 37585.932 100.0
init 0.059 0.000 0.000 0.059 0.0 37585.991 100.0
23 parallel 0.023 0.011 0.012 0.047 0.0 37586.038 100.0
6 cleanup 0.035 0.000 0.006 0.042 0.0 37586.080 100.0
2 checkParsed 0.034 0.001 0.006 0.041 0.0 37586.120 100.0
15 checkResolved 0.023 0.000 0.012 0.035 0.0 37586.155 100.0
18 flattenFunctions 0.020 0.000 0.011 0.032 0.0 37586.187 100.0
26 removeUnnecessaryAutoCopyCalls 0.010 0.010 0.011 0.031 0.0 37586.218 100.0
12 createTaskFunctions 0.002 0.011 0.011 0.024 0.0 37586.242 100.0
17 processIteratorYields 0.012 0.000 0.012 0.024 0.0 37586.266 100.0
25 bulkCopyRecords 0.000 0.011 0.011 0.022 0.0 37586.288 100.0
14 resolveIntents 0.005 0.000 0.011 0.017 0.0 37586.305 100.0
10 checkNormalized 0.003 0.000 0.011 0.014 0.0 37586.319 100.0
16 replaceArrayAccessesWithRefTemps 0.000 0.000 0.012 0.012 0.0 37586.331 100.0
5 expandExternArrayCalls 0.000 0.000 0.006 0.006 0.0 37586.337 100.0
8 flattenClasses 0.000 0.000 0.006 0.006 0.0 37586.343 100.0
3 docs 0.000 0.000 0.006 0.006 0.0 37586.348 100.0
4 readExternC 0.000 0.000 0.006 0.006 0.0 37586.354 100.0
driverCleanup 0.000 0.000 0.000 0.000 0.0 37586.354 100.0
startup 0.000 0.000 0.000 0.000 0.0 37586.354 100.0
total time 37583.090 0.746 2.517 37586.354
And from the time command of that compilation:
real 626m26.839s
user 626m15.228s
sys 0m10.160s
- Is the stack overflow specifically occurring when codegen'ing the large set of + operations? (as opposed to the statements that precede it?)
It's codegen'ing the + operations. There is a gigantic move statement that has a bunch of casts to int(64), dereferences of temps, and calls to +.
A little piece of the big statement is:
cast( type int(64)[10] +( cast( type int(64)[10]
+( cast( type int(64)[10]
+( deref( call_tmp[1725427] 112 54 )
deref( call_tmp[1725565] 112 54 ) ) )
deref( call_tmp[1725703] 112 54 ) ) )
deref( call_tmp[1725841] 112 54 ) ) )
- how deep is the recursion?
On my computer it segfaults at 573 frames deep.
- what's the constant multiplier on the number of stack frames per + operation?
A pattern like this is repeating in the backtrace, so it looks like 5 stack frames per +.
frame #536: 0x00000001001abfd7 chpl`CallExpr::codegen(this=0x000000011ea1e690) at expr.cpp:3376
frame #537: 0x0000000100077fb4 chpl`baseASTCodegen(ast=0x000000011ea1e690) at baseAST.cpp:677
frame #538: 0x00000001001ebe0a chpl`GenRet::GenRet(this=0x00007fff5fba4b68, ast=0x000000011ea1e690) at genret.h:145
frame #539: 0x00000001001c576d chpl`GenRet::GenRet(this=0x00007fff5fba4b68, ast=0x000000011ea1e690) at genret.h:144
frame #540: 0x00000001001bf39c chpl`CallExpr::codegenPrimitive(this=0x00000001208c3fc0) at expr.cpp:4736
The compiler option --no-denormalize
causes the gigantic statement to stay split up into smaller statements, and can work around this issue.
The compiler option --no-denormalize causes the gigantic statement to stay split up into smaller statements, and can work around this issue.
Oh, that's interesting. Tagging @mppf and @e-kayrakli on this as it suggests we may want to put a limit on how aggressively we denormalize.
The original test still segfaults on my laptop today.
Would it be possible/reasonable to create a future with enough accesses that it'll fail on all the test systems?
I can reproduce the original bug if I run ulimit -s 1024
first.
It looks like limiting denormalize can solve the issue, but I'm also seeing codegenPrimitive using 13KB of stack space per call. I'm looking at a way to solve that problem.
FWIW, this isn't getting in my way or anything. I was just thinking about trying to submit a future for it.
PR #10187 has a fix for this and gets codegenPrimitive stack usage down to 160b. There are some other related calls that occur in the recursion (like CallExpr::codegen) that are still around 1KB but that seems good enough for now.
Summary of Problem
The below program does not compile on some systems (my laptop) but does on others (our high performance
gaming pcresearch test server).Attached is the information about both systems. They have identical chapel versions, linux kernels, gcc versions, so cpu, current memory, and PCI information is also included, since my hunch is that this is a really weird hardware issue
The code is automatically generated for the purpose of timing how long accesses of an array take. The code in question was generated to time 93 total accesses into an array. Compiling this code (or outputting C code) produces a segmentation fault on the laptop, but successful builds on the research server. However, code generated to time 92 total accesses successfully builds on both the laptop and research machine.
Steps to Reproduce
Source Code:
This code was generate with the following python script:
Compile command: Either
chpl output.chpl
chpl --savec c_src output.chpl
Both fail which indicates (to me) achpl
issue and not agcc
issue.The python script can be run as a script. If you want to change the number of accesses generated, set the variable
accesses
to the desired number. The parameterN
is the size of the array, but changing it does not seem to impact whether or not it compiles.Configuration Information
Failing System (Laptop):
chpl --version
: chpl Version 1.16.0.e43acc7$CHPL_HOME/util/printchplenv --anonymize
:gcc --version
: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609uname -a
: Linux REDACTEDHOSTNAME 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linuxlscpu
:cat /proc/meminfo
lspci
Succeeding System (Research Server)
chpl --version
: chpl Version 1.16.0.e43acc7$CHPL_HOME/util/printchplenv --anonymize
:gcc --version
: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609uname -a
: Linux REDACTEDHOSTNAME 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linuxlscpu
:cat /proc/meminfo
lspci