JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.46k stars 5.46k forks source link

StackOverflowError in summarysize during misc test #19707

Closed nalimilan closed 7 years ago

nalimilan commented 7 years ago

I see new failures when building Fedora/RHEL RPM nightlies. These have bee, introduced since d47f24b638e8b0dfc1ec73d1406d7a66ae4b4a50 (from Dec. 22).

Error in testset misc:
Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(Core.Inference)))) $(Expr(:escape, :>)) $(Expr(:escape, :(Core.sizeof(Core))))
  StackOverflowError:
Error in testset misc:
Error During Test
  Test threw an exception of type StackOverflowError
  Expression: summarysize(Base) > 10000 * sizeof(Int)
  StackOverflowError:
Error in testset ambiguous:
Error During Test
  Expression evaluated to non-Boolean
  Expression: ((x->begin 
        isempty(x) || println(x)
    end))(detect_ambiguities(Core,Base; imported=true))
       Value: nothing

https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-25-x86_64/00492486-julia/build.log.gz

colbec commented 7 years ago

Also seen in openSUSE Leap 42.2.

tkelman commented 7 years ago

Does this happen in a default source build, or only when you're building against some system dependencies? Scroll up to where the ambiguity test ran and it should be printing the list of ambiguous methods.

colbec commented 7 years ago

Not sure what you mean by "list of ambiguous methods." In:

Julia Version 0.6.0-dev.1698
Commit 4793e88* (2016-12-25 07:38 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

terminal output that mentions "ambiguous" (selectively) is:

threads (2)               |   38.54  |  1.34  |  3.5 | 83.29      | 2158.79 
error (2)                 |    0.64  |  0.00  |  0.0 |  2.63      | 2158.79 
WARNING: Method definition ambig(Any, Integer) in module TestMain_ambiguous at /home/colin/Downloads/julia6/test/ambiguous.jl:7 overwritten at /home/colin/Downloads/julia6/test/ambiguous.jl:84.
boundscheck (5)           |    2.45  |  0.00  |  0.0 |  4.55      | 1654.04 
cartesian (5)             |    0.04  |  0.00  |  0.0 |  0.10      | 1656.18 
    From worker 2:  Skipping Base.<|
    From worker 2:  Skipping Base.>:
    From worker 2:  Skipping Base.active_repl
    From worker 2:  Skipping Base.active_repl_backend
asmvariant (5)            |    0.05  |  0.00  |  0.0 |  1.33      | 1658.32 
osutils (5)               |    0.02  |  0.00  |  0.0 |  0.24      | 1660.92 
ambiguous (2)             |    1.01  |  0.05  |  5.1 | 33.64      | 2158.79 
channels (5)              |    8.77  |  0.75  |  8.6 | 446.92     | 1877.34 
dsp (5)                   |    6.67  |  0.00  |  0.0 | 319.38     | 1877.34 
fft (2)                   |   16.24  |  0.79  |  4.9 | 767.73     | 2158.79 
examples (5)              |   12.14  |  0.00  |  0.0 | 679.32     | 2485.93 
parallel (3)              |   62.43  |  0.00  |  0.0 |  1.43      | 1100.48 
cmdlineargs (4)           |  102.76  |  0.05  |  0.1 | 50.24      | 1121.77 
    From worker 1:  Worker 3 failed running test misc:
Some tests did not pass: 1279452 passed, 0 failed, 2 errored, 0 broken.
 in finish at ./test.jl:524
 in macro expansion; at ./util.jl:288 [inlined]
 in macro expansion; at /home/colin/Downloads/julia6/test/testdefs.jl:0 [inlined]
 in anonymous at ./<missing>:?
 in eval at ./boot.jl:236
 in runtests at /home/colin/Downloads/julia6/test/testdefs.jl:16
 in runtests at /home/colin/Downloads/julia6/test/testdefs.jl:4
 in #586 at ./multi.jl:1455
 in run_work_thunk at ./multi.jl:1004
 in macro expansion at ./multi.jl:1455 [inlined]
 in #585 at ./event.jl:66misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(Core.Inference)))) $(Expr(:escape, :>)) $(Expr(:escape, :(Core.sizeof(Core))))
  StackOverflowError:
misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: summarysize(Base) > 10000 * sizeof(Int)
  StackOverflowError:

Test Summary:           |     Pass  Error  Broken     Total
  Overall               | 20030733      2      29  20030764

followed by:

osutils               |       21                       21
  ambiguous             |       42                       42
  channels              |       32                       32

Let me know where I should be looking.

tkelman commented 7 years ago

Looks like you only get the stack overflow then, not the Expression evaluated to non-Boolean from the ambiguity test. Does the stack overflow happen if you just run make test-misc on its own?

colbec commented 7 years ago

I have seen the non-Boolean error as well. Both make test-misc and make test-ambiguous run successfully without error when run by themselves.

tkelman commented 7 years ago

Ah, the ambiguity is reproducible when running math and ambiguous on the same worker, it was caused by https://github.com/JuliaLang/julia/pull/19627 and probably just needs a *(::MockQuantity, ::Bool) method to be defined. I can't reproduce the misc failure on ubuntu with make testall1 so we'll need to keep debugging that. What does Base.summarysize(Base) return for you from a normal REPL? If you add @show summarysize(Base) around line 189 of test/misc.jl, what value does it output for test-misc by itself vs a failing run?

colbec commented 7 years ago
julia> Base.summarysize(Base)
41035484

I inserted the @show statement after line with import Base.summarysize

> make test-misc
    JULIA test/misc
Test (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
summarysize(Base) = 41465530
misc (1)      |    8.42  |  0.86  | 10.3 | 1177.03    | 363.44  

Test Summary: |    Pass    Total
  Overall     | 1279454  1279454
    SUCCESS

Strangely, I can't see the string "summarysize" anywhere in the output from make testall which consistently fails with the same output.

tkelman commented 7 years ago

probably because the show statement is triggering the stack overflow too, but only if some subset of tests have run previously on the same worker. running JULIA_CPU_CORES=1 ./julia test/runtests.jl foo bar etc with the list of tests that ran previously on the same worker as misc should reproduce it too? And possibly allow some reducing of the test case to a smaller subset?

colbec commented 7 years ago

OK here is my progress: thinking that the error might have been triggered by something recent in the misc worker I ran

> JULIA_CPU_CORES=1 julia6 test/runtests.jl cmdlineargs enums markdown nullable grisu llvmcall repl show broadcast goto reflection mpfr pollfd resolve version read spawn math copy blas vecelement simdloop intfuncs random offsetarray staged iobuffer hashing dict triplequote char printf sparse/sparse

which, disappointingly, run successfully. I may be shooting myself in the foot by testing in roughly the reverse order of testall, I don't know. I'll keep expanding the list.

tkelman commented 7 years ago

it should include misc at the end to reproduce the stack overflow, and may be order dependent

colbec commented 7 years ago

Right, thanks for noticing the missing misc - rather important. Adding the full complement of tests which ran on the same worker and misc at the end triggers the failure. Order does not seem to be important since my tests ran in reverse order. I will now go back through previous tests and report back.

colbec commented 7 years ago

Ok this is getting complicated. > JULIA_CPU_CORES=1 julia6 test/runtests.jl math copy blas vecelement simdloop intfuncs random offsetarray staged iobuffer hashing dict triplequote char printf sparse/sparse linalg/matmul linalg/svd linalg/tridiag linalg/pinv linalg/givens linalg/arnoldi linalg/cholesky linalg/hessenberg misc produces the error but > JULIA_CPU_CORES=1 julia6 test/runtests.jl cmdlineargs enums markdown nullable grisu llvmcall repl show broadcast goto reflection mpfr pollfd resolve version read spawn math misc does not. So clearly I have to split the first (which failed), and I did this on the sparse test. Then ran both sides and both ran to success. Please correct my thinking, which is that maybe there are two or more tests, at least one from each of both sides of the sparse test, which in combination with misc triggers the failure.

tkelman commented 7 years ago

What I usually do with this kind of thing is (write a script to) remove one entry from the list at a time, if reproducing the failure happens without it, leave it off and try another. If there's no failure without one of the entries, mark it as required and don't try removing it again. Stop when removing any one more entry doesn't reproduce the failure any more.

colbec commented 7 years ago

Just in case it is meaningful, while running these tests there is a quantity of output which does not occur when running make testall, here is a sample (this appears to be repeated over and over and fills my terminal buffer scrolling other information out of view):

in summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:741
   in summarysize(::Array{Any,1}, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:768
   in _summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:752
   in summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.
colbec commented 7 years ago

There's something odd about triplequote. > JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/pinv linalg/matmul offsetarray linalg/tridiag linalg/givens linalg/arnoldi triplequote math linalg/cholesky sparse/sparse linalg/svd misc fails, but if I remove triplequote then it succeeds. Mean anything?

colbec commented 7 years ago

After an update to Version 0.6.0-dev.1701 (2016-12-26 06:25 UTC), make testall still fails but no longer mentions errors related to stack overflow.

parallel (5)              |   62.45  |  0.00  |  0.0 |  5.32      | 1579.21 
cmdlineargs (2)           |  102.13  |  0.10  |  0.1 | 53.52      | 3090.21 
    From worker 1:  Worker 3 failed running test ambiguous:
Some tests did not pass: 41 passed, 0 failed, 1 errored, 0 broken.
 in finish at ./test.jl:524
 in macro expansion; at ./util.jl:288 [inlined]
 in macro expansion; at /home/colin/Downloads/julia6/test/testdefs.jl:0 [inlined]
 in anonymous at ./<missing>:?
 in eval at ./boot.jl:236
 in runtests at /home/colin/Downloads/julia6/test/testdefs.jl:16
 in runtests at /home/colin/Downloads/julia6/test/testdefs.jl:4
 in #586 at ./multi.jl:1455
 in run_work_thunk at ./multi.jl:1004
 in macro expansion at ./multi.jl:1455 [inlined]
 in #585 at ./event.jl:66ambiguous: Error During Test
  Expression evaluated to non-Boolean
  Expression: ((x->begin 
        isempty(x) || println(x)
    end))(detect_ambiguities(Core,Base; imported=true))
       Value: nothing

Test Summary:           |     Pass  Error  Broken     Total
  Overall               | 20027342      1      29  20027372
...
  cmdlineargs           |      121                      121
  compile               |       61                       61
    FAILURE
Error in testset ambiguous:
Error During Test
  Expression evaluated to non-Boolean
  Expression: ((x->begin 
        isempty(x) || println(x)
    end))(detect_ambiguities(Core,Base; imported=true))
       Value: nothing
ERROR: LoadError: 
 in (::##33#41)() at /home/colin/Downloads/julia6/test/runtests.jl:195
...

FYI here is the new set of tests that produce the failure: > JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/qr linalg/schur linalg/special linalg/bunchkaufman linalg/bidiag linalg/lu linalg/lq sparse/sparsevector dates/rounding dates/types dates/io dates/arithmetic core arrayops copy math spawn read resolve broadcast lineedit replcompletions replutil sets goto grisu meta stacktraces profile docs markdown base64 serialize threads libdl int intset floatfuncs inline boundscheck error cartesian asmvariant osutils channels fft ambiguous

colbec commented 7 years ago

FWIW for me the above test set reduces to JULIA_CPU_CORES=1 julia6 test/runtests.jl math ambiguous which fails. Remove math and it runs successfully, make test-math is also successful.

tkelman commented 7 years ago

Ah, the ambiguity is reproducible when running math and ambiguous on the same worker, it was caused by #19627 and probably just needs a *(::MockQuantity, ::Bool) method to be defined.

That one should be simple to fix. The stack overflow is the more interesting part of this report. make testall won't be exactly deterministic in terms of which tests run on which worker, since it's timing-dependent. If you run JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/pinv linalg/matmul offsetarray linalg/tridiag linalg/givens linalg/arnoldi triplequote math linalg/cholesky sparse/sparse linalg/svd misc again on latest master does the stack overflow still happen? Can anything else on that list be removed and still trigger the stack overflow? I'm guessing maybe something is interfering with inference of an element type somewhere and resulting in an Array{Any,1} that's actually supposed to have a more specific element type, but I don't know what would cause that.

colbec commented 7 years ago

The command: JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/pinv linalg/matmul offsetarray linalg/tridiag linalg/givens linalg/arnoldi triplequote math linalg/cholesky sparse/sparse linalg/svd misc now runs successfully on v. 1701.

tkelman commented 7 years ago

What version did it last fail at? Did it happen reliably every time, or is it maybe non-deterministic?

colbec commented 7 years ago

Last seen in v 1698 as reported above. Failed every time that I could see, but of course the sample size is very small.

tkelman commented 7 years ago

So that was 4793e88, and there's only been one commit to master since then, changing gmp and curl versions.

nalimilan commented 7 years ago

FWIW, I still get the stack overflow with latest master. Thanks for investigating this @colbec!

colbec commented 7 years ago

@nalimilan Yes the latest master still fails with overflow when using the full test set. Here is my latest single core set which causes fail: > JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/matmul linalg/bunchkaufman linalg/bidiag linalg/symmetric linalg/lq sparse/sparsevector dates/ranges dates/arithmetic inference worlds keywordargs printf char triplequote dict hashing iobuffer staged offsetarray tuple random intfuncs simdloop blas copy math loading bigint spawn read mmap broadcast mod2pi show sets llvmcall grisu meta stacktraces profile docs markdown threads inline error cartesian asmvariant osutils channels dsp examples misc The voluminous output from interactiveutil.jl is inconvenient in testing since it scrolls off useful information regarding testing times. I am working on the idea that if the testing reports that x test took the most time then I target that one for next removal, which in theory should speed up the next test.

colbec commented 7 years ago

Here's another observation: if I select the linalg tests from the above set and run them followed by misc then it succeeds (this is consistently repeatable). The set of not(linalg/*) entries followed by misc also succeeds (this is consistently repeatable). Now taking the not(linalg/*) set and adding back linalg/matmul causes fail (repeatable).

Edit: but adding back linalg/bunchkaufman but not matmul sometimes succeeds, sometimes fails, about 50-50 from a small sample. Edit 2: when adding back a linalg test which subsequently fails, the testing process does not end with "FAILURE" message, but continues to flood output and stops at

ERROR: LoadError: 
 in (::##33#41)() at /home/colin/Downloads/julia6/test/runtests.jl:195
 in cd(::##33#41, ::String) at ./file.jl:69
 in include_from_node1(::String) at ./loading.jl:532
 in include(::String) at ./sysimg.jl:14
 in process_options(::Base.JLOptions) at ./client.jl:293
 in _start() at ./client.jl:357
while loading /home/colin/Downloads/julia6/test/runtests.jl, in expression starting on line 38

after the usual flood of interactiveutil.jl stuff.

colbec commented 7 years ago

Update: I am now on version 1781 after significant update overnight and make testall still fails. Excerpts:

    From worker 1:  Worker 1 failed running test broadcast:
UndefVarError: TestMain_broadcast not defined
 in deserialize(::Base.ClusterSerializer{TCPSocket}, ::Type{Module}) at ./serialize.jl:601
 in handle_deserialize(::Base.ClusterSerializer{TCPSocket}, ::Int32) at ./serialize.jl:580
... and repeats ... ...

then

 in #555 at ./event.jl:66misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(Core.Inference)))) $(Expr(:escape, :>)) $(Expr(:escape, :(Core.sizeof(Core))))
  StackOverflowError:
misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: summarysize(Base) > 10000 * sizeof(Int)
tkelman commented 7 years ago

the broadcast issue is unrelated and will be fixed by https://github.com/JuliaLang/julia/pull/19745

colbec commented 7 years ago

@nalimilan @tkelman make testall succeeds in v 0.6 1838 on openSUSE. Close issue now?

tkelman commented 7 years ago

I saw this a couple of times on win64 yesterday. It's intermittent, but I think it's still there.

colbec commented 7 years ago

Yes, it is, I just ran the same version again and it failed. Pity I did not get a copy of the output, it might have been useful for comparison. I'll try again.

colbec commented 7 years ago

I ran three more make testall with the same version of v 0.6, there were 2 fails and the last succeeded. FWIW here is the list of tests that ran on the same worker as misc on the successful run: linalg/triangular sparse/higherorderfns sparse/spqr strings/io unicode/utf8 dates/query dates/periods inference worlds numbers abstractarray operators ccall bigfloat sorting version resolve pollfd mpfr broadcast combinatorics rounding euler lineedit replcompletions repl goto llvmcall grisu nullable base64 misc inline boundscheck cartesian asmvariant osutils channels dsp I'm open to suggestions how this info might be useful in comparison with previous observed fails.

tkelman commented 7 years ago

the dispatch loop you posted in https://github.com/JuliaLang/julia/issues/19707#issuecomment-269121746 is the best lead so far. Maybe we should try to find what inputs are causing that loop?

maleadt commented 7 years ago

I'm also seeing this issue on a bot running testall1, but cannot reproduce with the sets of tests reported by @colbec in here. I'm running a bisect script now, lets hope it yields something useful over the weekend.

colbec commented 7 years ago

I too am having troubles repeating some of my own tests, including the short test that generated the loop mentioned by @tkelman . One that does fail pretty consistently including the loop is:

JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/matmul linalg/bunchkaufman linalg/bidiag linalg/symmetric linalg/lq sparse/sparsevector dates/ranges dates/arithmetic inference worlds keywordargs printf char triplequote dict hashing iobuffer staged offsetarray tuple random intfuncs simdloop blas copy math loading bigint spawn read mmap broadcast mod2pi show sets llvmcall grisu meta stacktraces profile docs markdown threads inline error cartesian asmvariant osutils channels dsp examples misc > miscfail.txt
WARNING: Method definition f265a(Any) in module TestMain_worlds at /home/colin/Downloads/julia6/test/worlds.jl:12 overwritten at /home/colin/Downloads/julia6/test/worlds.jl:17.
Warning: threaded loop executed in order
ERROR: LoadError: 
Stacktrace:
 [1] (::##33#41)() at /home/colin/Downloads/julia6/test/runtests.jl:195
 [2] cd(::##33#41, ::String) at ./file.jl:69
 [3] include_from_node1(::String) at ./loading.jl:532
 [4] include(::String) at ./sysimg.jl:14
 [5] process_options(::Base.JLOptions) at ./client.jl:308
 [6] _start() at ./client.jl:374
while loading /home/colin/Downloads/julia6/test/runtests.jl, in expression starting on line 38

I have saved the entire output of the fail to a file (size 12 MB) and can upload this if it is of use. It is extremely repetitive. One new detail I had not seen before is

...
dsp (1)                 |    6.38  |  0.55  |  8.6 | 302.97     | 1006.70 
examples (1)            |   11.88  |  0.51  |  4.3 | 672.37     | 1006.70 
misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(
  StackOverflowError:
  Stacktrace:
   [1] setindex!(::ObjectIdDict, ::Any, ::Any) at ./associative.jl:310    
   [2] _summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:745
   [3] summarysize(::Any, ::ObjectIdDict, ::Type{T}) at ./interactiveutil.jl:741
...

with the reference to associative.jl.

maleadt commented 7 years ago

Oh yeah that set reproduces it. I'll use it as a starting point for my bisect, let's hope I can shave some more tests of.

barche commented 7 years ago

I get this on Arch linux too, using the following Make.user:

USE_SYSTEM_LLVM:=1
USE_SYSTEM_LIBUNWIND:=1
USE_SYSTEM_PCRE:=1
USE_SYSTEM_LIBM:=1
USE_SYSTEM_OPENLIBM:=1
USE_SYSTEM_OPENSPECFUN:=1
USE_SYSTEM_DSFMT:=0
USE_SYSTEM_BLAS:=0
USE_SYSTEM_LAPACK:=0
USE_SYSTEM_FFTW:=1
USE_SYSTEM_GMP:=1
USE_SYSTEM_MPFR:=1
USE_SYSTEM_ARPACK:=1
USE_SYSTEM_SUITESPARSE:=1
USE_SYSTEM_LIBUV:=0
USE_SYSTEM_UTF8PROC:=0
USE_SYSTEM_LIBGIT2:=1
USE_SYSTEM_PATCHELF:=1
USE_INTEL_MKL:=0
USE_BLAS64:=0
USE_LLVM_SHLIB:=0
timholy commented 7 years ago

StackOverflowErrors are often annoying. If it's easier than trapping a difficult-to-catch error...I have noticed that many arise from constructs like this:

foo{T}(x::T, y::T) = # the real implementation of foo
foo(x, y) = foo(promote(x, y)...)

The problem arises because promote(x, y) has a default of returning (x, y) when they can't be promoted to a common type, leading that second line to call itself recursively. I've recently become a big fan of "defensive programming" like this:

foo(x, y) = _foo(promote(x, y)...) # note the underscore
_foo{T}(x::T, y::T) = foo(x, y)
_foo(x, y) = throw_promoteerror(x, y)

@noinline function throw_promoteerror(x, y)
    throw(ArgumentError("$x::$(typeof(x)) and $y::$(typeof(y)) cannot be promoted to a common type"))
end

I'd argue we should do this systematically throughout Base. This may be one of those cases where it would be faster to find the bug by changing any possibly-problematic lines to the more defensive form, since the error message will tell you a lot about what's going on.

maleadt commented 7 years ago

It seems to be nondeterministic even with 1 worker... The set below definitely does make it crash, but takes many attempts to reproduce (I'm testing on cbc6670dd612ca5893f1f64e5fc5ef88b5727587):

$ JULIA_CPU_CORES=1 ./julia test/runtests.jl linalg/matmul linalg/bidiag linalg/symmetric sparse/sparsevector inference worlds printf dict offsetarray random math spawn broadcast stacktraces profile markdown dsp misc
Test (Worker)           | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
linalg/matmul (1)       |   67.16  |  0.81  |  1.2 | 2154.20    | 293.20
linalg/bidiag (1)       |   37.77  |  0.73  |  1.9 | 1197.66    | 362.35
linalg/symmetric (1)    |   15.78  |  0.25  |  1.6 | 469.39     | 374.47
sparse/sparsevector (1) |   81.53  |  1.19  |  1.5 | 2017.42    | 486.43
inference (1)           |    1.16  |  0.01  |  1.3 | 32.51      | 490.12
WARNING: Method definition f265a(Any) in module TestMain_worlds at /tmp/julia/test/worlds.jl:12 overwritten at /tmp/julia/test/worlds.jl:17.
worlds (1)              |    0.91  |  0.02  |  1.7 | 19.14      | 495.35
printf (1)              |   15.55  |  0.32  |  2.1 | 248.24     | 497.46
dict (1)                |    9.33  |  0.23  |  2.5 | 343.33     | 512.73
offsetarray (1)         |   39.76  |  0.96  |  2.4 | 1079.01    | 668.10
random (1)              |    9.13  |  0.20  |  2.2 | 273.29     | 685.34
math (1)                |   21.47  |  0.53  |  2.5 | 374.49     | 690.46
       [stdio passthrough ok]
spawn (1)               |   19.43  |  0.19  |  1.0 | 206.89     | 709.43
broadcast (1)           |   21.40  |  1.18  |  5.5 | 683.05     | 717.90
stacktraces (1)         |    1.42  |  0.04  |  2.9 | 31.47      | 721.82
profile (1)             |    3.08  |  0.06  |  2.0 | 59.57      | 723.25
markdown (1)            |    3.63  |  0.08  |  2.3 | 87.71      | 724.30
dsp (1)                 |    7.04  |  0.28  |  4.0 | 313.35     | 728.03
misc: Error During Test
  Test threw an exception of type StackOverflowError

There also seem to be 2 slightly different errors:

misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: $(Expr(:escape, :(summarysize(Core)))) $(Expr(:escape, :>)) $(Expr(:escape, :(summarysize(Core.Inference)))) $(Expr(:escape, :>)) $(Expr(:escape, :(Core.sizeof(Core))))

and

misc: Error During Test
  Test threw an exception of type StackOverflowError
  Expression: summarysize(Base) > 10000 * sizeof(Int)

both when executing the misc test.

I guess I'll try and catch one with rr to see if I can gather some more details.

timholy commented 7 years ago

I played with this too. You were much more successful than I in trimming it down. (I think I was able to get rid of printf but I had many others, perhaps it's not a specific combination.) My one other observation is that it's coupled to the summarysize method for TypeMapEntry, and inserting

@show typeof(m)
flush(STDOUT)

sometimes results in

typeof(m) = ERROR...

so it seems to be some kind of object corruption.

colbec commented 7 years ago

Fools rush in: to try to force an early error I tried multiple testing misc in the following way (edit3 this is a red herring, removing the references to warnonce and testonce from misc leads to success, but the overall failure persists):

JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/matmul misc linalg/bunchkaufman misc linalg/bidiag linalg/symmetric linalg/lq misc

Edit: in fact to get this I just have to say JULIA_CPU_CORES=1 julia6 test/runtests.jl misc misc Edit2: JULIA_CPU_CORES=1 julia6 test/runtests.jl dsp dsp is successful, as is doubling linalg/matmul

I switch off line 5 in runtests.jl (which ensures that the tests array set is unique), but on second run of misc it is very unhappy and outputs:

Test (Worker)           | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB)
linalg/matmul (1)       |   59.26  |  0.68  |  1.1 | 2193.33    | 296.49  
OK to here
misc (1)                |    9.81  |  0.82  |  8.4 | 1219.46    | 406.40  
linalg/bunchkaufman (1) |   20.26  |  0.22  |  1.1 | 699.76     | 429.01  
WARNING: replacing module TestMain_misc
OK to here
misc: Test Failed
  Expression: contains(sprint((io->begin 
                Base.warn_once(io,"test")
            end)),"WARNING: test")
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:532
 [2] include(::String) at ./sysimg.jl:14
 [3] macro expansion; at /home/colin/Downloads/julia6/test/testdefs.jl:13 [inlined]
...
Test Summary:         |    Pass  Fail    Total
  Overall             | 6403639    12  6403651
  linalg/matmul       |     527            527
  misc                | 1279505        1279505
  linalg/bunchkaufman |    2219           2219
  misc                | 1279502     3  1279505
  linalg/bidiag       |    1232           1232
  misc                | 1279502     3  1279505
  linalg/symmetric    |    1328           1328
  misc                | 1279502     3  1279505
  linalg/lq           |     820            820
  misc                | 1279502     3  1279505
    FAILURE
colbec commented 7 years ago

In Version 0.6.0-dev.1997 I believe I see a consistent pattern of success/fail with the following set:

JULIA_CPU_CORES=1 julia6 test/runtests.jl linalg/matmul linalg/bunchkaufman linalg/bidiag linalg/symmetric linalg/lq sparse/sparsevector dates/ranges dates/arithmetic inference worlds keywordargs printf char triplequote dict hashing iobuffer staged offsetarray tuple random intfuncs simdloop blas copy math loading bigint spawn misc

This set fails, but if misc runs before spawn then it succeeds. I will try digging into spawn to see if I can isolate a section which will allow a succeed.

colbec commented 7 years ago

In my previous comment I noted an issue with the order of spawn and misc. Here is my testing process for examining spawn: 1. place an end statement at the end of the file and then place an if false statement above it in a convenient location to wrap the end of the file in an if block to ensure it will not be executed, and 2. slide the if false up and down to try to find a sensitive spot. As we know this is an intermittent error which maybe explains my bizarre results. At about line 80 in the spawn tests there is a section which begins @test success(truecmd); I can place the sliding if false before this line and the tests mostly pass. If I place the if false before the STDIN redirection a few lines later it mostly fails with the stackoverflow error. This is about as close as I can come so far.

I tried commenting out the @test_broken lines and it still failed. However I remain suspicious that there is maybe a problem in the first 80+ lines of this file after the cmd definitions. The individual commands all seem to pass in the REPL. Of course it could be something else entirely.

vtjnash commented 7 years ago

I wonder if it could be related to / fixed by https://github.com/JuliaLang/julia/pull/19590

nalimilan commented 7 years ago

@vtjnash I've just tried, and I got the same error.

nalimilan commented 7 years ago

Confirmed it's fixed on latest master.