JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.72k stars 5.48k forks source link

truncated backtraces #3469

Closed timholy closed 10 years ago

timholy commented 11 years ago

In https://github.com/timholy/Profile.jl/issues/19 I documented that sometimes libunwind returns truncated ("unrooted") backtraces. Now I believe I have traced this to inlining, and hence I wonder if it could be a Julia issue rather than a libunwind issue.

Using the same code as in the Profile.jl issue, I tried this:

julia> using Profile

julia> x = linspace(0,10,10^8);

julia> Base.inlineable(f, e::Expr, sv, enclosing_ast) = Base.NF
Warning: Method definition inlineable(Any,Expr,Any,Any) in module Base at inference.jl:1621 overwritten in module Main at none:1.
# methods for generic function inlineable
inlineable(f,e::Expr,sv,enclosing_ast) at none:1

julia> include("/tmp/unrooted.jl")

julia> doiceil(x)

julia> @sprofile doiceil(x)

julia> sprofile_tree(true)
260  ???; ???; line: 0
 34 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 0
 38 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 11
 32 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 19
 2  /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 38
 30 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 42
88   array.jl; getindex; line: 259
28   base.jl; convert; line: 9
32   float.jl; <; line: 162
34   float.jl; convert; line: 22
32   float.jl; itrunc; line: 67
108  int.jl; +; line: 36
49   int.jl; <=; line: 235
9301 julia; ???; line: 0
 9301 /lib/x86_64-linux-gnu/libc.so.6; __libc_start_main; line: 237
  9301 julia; ???; line: 0
   9301 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; julia_trampoline; line: 131
    9301 julia; ???; line: 0
     9301 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
      9301 ???; ???; line: 0
       9301 client.jl; _start; line: 341
        9301 client.jl; run_repl; line: 143
         9301 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
          9301 ???; ???; line: 0
           9301 client.jl; eval_user_input; line: 68
            9301 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
             9301 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
              9299 /home/tim/.julia/Profile/src/sprofile.jl; anonymous; line: 309
               9299 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                63   /tmp/unrooted.jl; doiceil; line: 11
                8647 /tmp/unrooted.jl; doiceil; line: 12
                 66   /tmp/unrooted.jl; unsafe_iceil; line: 4
                 73   /tmp/unrooted.jl; unsafe_iceil; line: 5
                 7721 /tmp/unrooted.jl; unsafe_iceil; line: 6
                  106  ???; ???; line: 0
                  39   float.jl; <; line: 162
                  7387 promotion.jl; +; line: 135
                   2518 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                    1740 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                     1157 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                     35   ???; ???; line: 0
                     35   int.jl; +; line: 36
                    45   ???; ???; line: 0
                   4267 promotion.jl; promote; line: 118
                    3301 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                     1    /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_gc_collect; line: 335
                     9    /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_gc_collect; line: 389
                      9 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                       5 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                     1237 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_gc_collect; line: 533
                      1237 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                       84 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                     3    /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_gc_collect; line: 829
                     10   /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_gc_collect; line: 911
                     1039 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_gc_collect; line: 1020
                      1039 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                       1  /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                       15 /lib/x86_64-linux-gnu/libc.so.6; ???; line: 0
                       4  /lib/x86_64-linux-gnu/libc.so.6; cfree; line: 32
                     27   /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_gc_collect; line: 1029
                      27 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                     42   /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_gc_collect; line: 1040
                      42 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                       23 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                     1    /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_gc_collect; line: 1274
                      1 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                     1    /lib/x86_64-linux-gnu/libc.so.6; __libc_malloc; line: 0
                     1    /lib/x86_64-linux-gnu/libc.so.6; __libc_malloc; line: 5
                     1    /lib/x86_64-linux-gnu/libc.so.6; __libc_malloc; line: 46
                     1    /lib/x86_64-linux-gnu/libc.so.6; __libc_malloc; line: 53
                     1    /lib/x86_64-linux-gnu/libc.so.6; __libc_malloc; line: 86
                     27   /lib/x86_64-linux-gnu/libc.so.6; __libc_malloc; line: 117
                      27 /lib/x86_64-linux-gnu/libc.so.6; ???; line: 0
                     7    /lib/x86_64-linux-gnu/libc.so.6; __libc_malloc; line: 137
                     1    /lib/x86_64-linux-gnu/libc.so.6; __libc_malloc; line: 208
                    123  /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 0
                    1    /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 1
                    71   /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 11
                    2    /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 12
                    71   /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 19
                    2    /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 27
                    2    /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 38
                    64   /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 42
                    3    /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_box_int64; line: 43
                    3    ???; ???; line: 0
                     3 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                  73   promotion.jl; promote; line: 118
                 229  ???; ???; line: 0
                 29   float.jl; convert; line: 22
                 35   float.jl; itrunc; line: 67
                 36   operators.jl; >; line: 15
                 34   promotion.jl; +; line: 135
                54   /tmp/unrooted.jl; unsafe_iceil; line: 4
                38   /tmp/unrooted.jl; unsafe_iceil; line: 6
                267  ???; ???; line: 0
                64   array.jl; getindex; line: 259
                35   base.jl; convert; line: 9
                97   int.jl; +; line: 36
                34   int.jl; <=; line: 235
              2    /home/tim/.julia/Profile/src/sprofile.jl; anonymous; line: 311
               2 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                2 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                 2 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                  2 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                   2 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                    2 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                     2 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                      2 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                       2 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                        1 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                         1 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                          1 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                        1 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; jl_dlsym_e; line: 17
                         1 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; uv_dlsym; line: 25
                          1 /lib/x86_64-linux-gnu/libdl.so.2; dlerror; line: 160
                           1 /lib/x86_64-linux-gnu/libc.so.6; ???; line: 0
                            1 /lib/x86_64-linux-gnu/libc.so.6; ???; line: 0
                             1 /lib/x86_64-linux-gnu/libc.so.6; ???; line: 0
                              1 /lib/x86_64-linux-gnu/libc.so.6; ???; line: 0
                               1 /lib/x86_64-linux-gnu/libc.so.6; ???; line: 0
                                1 /lib/x86_64-linux-gnu/libc.so.6; ???; line: 0
84   operators.jl; >; line: 15
21   promotion.jl; promote; line: 118

The thing that caught my eye is that there are no longer any unrooted backtraces in unsafe_iceil.

JeffBezanson commented 11 years ago

Wow, you figured out how to disable inlining forcibly by overwriting my method. @timholy your deviousness knows no bounds!

JeffBezanson commented 11 years ago

More to the point, let me see if I fully understand the issue. Is it that inlining squashes the entire backtrace, as opposed to just the frame for the inlined function? Not sure how that might happen.

timholy commented 11 years ago

Is it that inlining squashes the entire backtrace, as opposed to just the frame for the inlined function? Not sure how that might happen.

Yes, that's it. Whatever is happening, it's super-weird. Sometimes you get a full backtrace, and sometimes you get a backtrace where the inlined function appears to be the "root" from which everything is called. The mechanism is that on the next iteration of the loop in rec_backtrace, unw_step returns 0, which is why the backtrace is truncated.

The really weird part is that you get both behaviors during a run of the function (with no intermediate recompiling). The pointer addresses for the truncated backtraces are within a few bytes of the addresses for the ones that yield full backtraces. I haven't done a careful analysis of whether there's any overlap in the exact addresses; if that would help figure this out, I'd be glad to do it.

Seems like it could easily be a libunwind bug, but the fact that this seems restricted to inlined functions made me think it's possible there's more going on.

timholy commented 11 years ago

Before getting to more substantive issues...

Wow, you figured out how to disable inlining forcibly by overwriting my method. @timholy your deviousness knows no bounds!

It doesn't reflect any deep understanding of inference.jl. Had you named that function might_be_a_toaster(), your secret would have stayed safe from me forever :-).

timholy commented 11 years ago

OK, so it was pretty obvious that analyzing the instruction pointers would be a good thing to do. I accumulated 80 runs of doiceil with x of length 10^8, resulting in 28,255 backtraces. There were only 2 lengths (3 and 19), of which 8,091 were of length 3. For those of length 3, there were only 6 unique third (meaning, the point of the code that was executing when the signal was triggered) pointers:

julia> sort(unique(pshort))
6-element Uint64 Array:
 0x00007f3788531d99
 0x00007f3788531d9e
 0x00007f3788531da2
 0x00007f3788531da5
 0x00007f3788531da8
 0x00007f3788531dab

For reference, I'll note the disassembly for unsafe_iceil is:

Filename: /tmp/unrooted.jl
Source line: 4
        push    RBP
        mov     RBP, RSP
Source line: 4
        cvttsd2si       RCX, XMM0
Source line: 5
        cvtsi2sd        XMM1, RCX
        ucomisd XMM0, XMM1
Source line: 6
        seta    AL
        movzx   EAX, AL
        add     RAX, RCX
        pop     RBP
        ret

For the full-length backtraces, there were 14 distinct instruction pointers:

julia> sort(unique(pfull))
14-element Uint64 Array:
 0x00007f3788531d28
 0x00007f3788531d2d
 0x00007f3788531d33
 0x00007f3788531d38
 0x00007f3788531d3d
 0x00007f3788531d40
 0x00007f3788531d44
 0x00007f3788531d48
 0x00007f3788531d4e
 0x00007f3788531d90   # this is probably where they start to get relevant?
 0x00007f3788531d91
 0x00007f3788531dac
 0x00007f378afc1bb0
 0x00007f378afc1bba

And the interesting part:

julia> intersect(pshort, pfull)
0-element Uint64 Array

With statistics of this size, this cannot be a coincidence.

I'm back to thinking this is a libunwind issue. I've only just started to take baby steps in learning to read the disassembly, and not yet able to make inferences about how the instruction pointers relate to particular points in the disassembly (e.g., I don't know how many bytes to expect in each line). Maybe the pattern is meaningful to someone else?

All this is x86_64.

JeffBezanson commented 11 years ago

Nice investigation! It seems consistent with the idea that the state is slightly different at different points in the function, some of which leave libunwind confused. You might be able to use cfunction to get the start address of the function, then maybe we can match these addresses to the disassembly. It seems the addresses from the long backtraces include both the smallest (0x...90) and the largest (0x...ac), while the addresses from the short traces are in between.

timholy commented 11 years ago

Thanks for the tip re cfunction. Fortunately I still have that session up and running:

julia> cfunction(unsafe_iceil, Int, (Float64,))
Ptr{Void} @0x00007f3788531d90

So the first "interesting" one from the long backtrace seems to be the entrypoint.

timholy commented 11 years ago

Blake has just done some really nice work revealing that these truncated backtraces are not Julia-specific. A Julia-specific problem, however, seems to be segfaults on OSX. https://github.com/timholy/Profile.jl/issues/14#issuecomment-19818652

So now that begs the question, should this be closed? There's useful data here for debugging a problem, and there's an even more serious issue, but it's not obviously relevant to the original issue.

pao commented 11 years ago

I take it that the conclusion is that this is a libunwind problem? Marking "upstream"; their tracker is http://savannah.nongnu.org/bugs/?group=libunwind

timholy commented 11 years ago

As testing has progressed, it's not entirely clear what the situation is; it may not be purely upstream.

timholy commented 11 years ago

I'm in the process of filing a bug report to libunwind. I'm guessing that LLVM must be running with options in much the same way that we pass a bunch of options to gcc when we compile Julia. I've found references to LLVM_CONFIG in src/, and I've also found a binary llvm-config in usr/bin, but I don't know how to find out what settings Julia is using to compile code. (I figure libunwind developers may want to know that information.)

ViralBShah commented 11 years ago

LLVM has its own build time options and also options that are set at runtime for the generated code.

JeffBezanson commented 11 years ago

We set some options in jl_init_codegen at the bottom of codegen.cpp. But, it's likely the libunwind folks will want to see a C-only test case.

vtjnash commented 11 years ago

You could try setting options.JITExceptionHandling = 1; in codegen.cpp and see if that helps on linux. On Apple, without that line, unw_step would return 0 for every JIT frame, although it would still continue to step upwards to the next frame if that return value was ignored.

JeffBezanson commented 11 years ago

We might want to try -fasynchronous-unwind-tables. libunwind also has a function _U_dyn_register for registering unwind info for dynamically generated code.

timholy commented 11 years ago

I'm very grateful for the help here, folks.

@blakejohnson has put together a Julia-free testcase (C or C+fortran) here: https://github.com/timholy/Profile.jl/issues/14. The problem is that on OSX it seems to only happen for OpenBLAS calls, and on Linux (gcc) I can't yet get it to happen at all even when I force C inlining.

Thanks for the tips re the compile options. options.JITExceptionHandling = 1; did not fix the problem. (@vtjnash, I too have been wondering about seeing what happens if we skip over return values of 0, but haven't gone there yet.)

I added -fasynchronous-unwind-tables to SHIPFLAGS in Make.inc, and with a make -n verified that, after a make clean, files in src/ were getting compiled with it. No luck there. I didn't expect this to help the inlining problem, since in my testcase it only happens to JIT-compiled code, but I was curious whether some of the unresolvable lines (that get printed as julia; ???; line: 0 for example) would have been fixed. No luck there, either.

I had also already checked whether eliminating -momit-leaf-frame-pointer from that Make.inc solved the problem; that didn't, either.

To me _U_dyn_register seems rather promising. http://www.nongnu.org/libunwind/man/libunwind(3).html#section_6 seems to imply it's necessary, although clearly we've been getting away without it for most purposes. I've also found several references suggesting that missing unwind info can lead to truncated backtraces.

From what I can tell, _U_dyn_register needs to be called after compilation of every function. I was guessing this should go into codegen:emit_function or codegen:to_function? I've also found that deps/libunwind-1.1/tests/Gtest-dyn1.c contains an example of how to use it. All this involves stuff that's somewhat above my head right now, but maybe not so far that I can't stretch.

timholy commented 11 years ago

The understanding we've gained so far has also allowed me to finally generate a small test case that isolates a related problem I've seen before:

nest1(x::Vector) = x[1] + x[2]

function nest2{T}(x::Vector{T}, n)
    s = zero(T)
    for i = 1:n
        s += nest1(x)
    end
    s
end

function nest3(x, y, n, usefirst::Bool)
    if usefirst
        return nest2(x, n)
    else
        return nest2(y, n)
    end
end

x = randn(5)
y = [0.8]

Now do this:

julia> nest3(x, y, 10, true)
0.6220158265705533

julia> nest3(x, y, 10, false)
ERROR: BoundsError()
 in nest2 at /tmp/testbt.jl:6

Note that this backtrace is truncated; there is no information about who called nest2. This demonstrates that this issue is not restricted to Profile, nor is it a consequence of using signals.

vtjnash commented 11 years ago

For the record, I get the full backtrace on Mac64, but not on Linux64 when using your example code.

blakejohnson commented 11 years ago

_U_dyn_register also caught my attention while reading the libunwind docs. Unfortunately, it is not implemented in Apple's libunwind: http://www.opensource.apple.com/source/libunwind/libunwind-35.1/include/libunwind.h

So, it might only be relevant on Linux.

timholy commented 11 years ago

One more observation: I started thinking that the best short-term solution might be to see what happens if we ignore the return value of unw_step. The reasoning is that certain routines in OpenBLAS also seem to generate truncated backtraces, so even if we fix the Julia inlining issue we're still left with the OpenBLAS one. CC @xianyi.

However, testing on this "nest" example showed that this strategy doesn't work; I get the message that the backtrace repeats 79998 times. This indicates that unw_step doesn't advance the ip any further upon successive calls.

So it seems that to fix this we may need to generate the missing unwind information.

timholy commented 11 years ago

Update: while I've (slowly) been teaching myself a bit about LLVM and related topics in order to try to address this issue, the behavior has changed:

In addition, it's worth showing the latest SProfile results on doiceil, because there are a few interesting/odd results:

1   /home/tim/tmp/backtrace_cg.jl; unsafe_iceil; line: 4
446 julia; ???; line: 0
 446 /lib/x86_64-linux-gnu/libc.so.6; __libc_start_main; line: 237
  446 julia; ???; line: 0
   446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; julia_trampoline; line: 215
    446 julia; ???; line: 0
     446 julia; ???; line: 0
      446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
       446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
        446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
         446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
          446 ???; ???; line: 0
           446 client.jl; _start; line: 370
            446 client.jl; run_repl; line: 166
             446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
              446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
               446 ???; ???; line: 0
                446 client.jl; eval_user_input; line: 91
                 446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
                  446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; jl_toplevel_eval; line: 44
                   446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
                    446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
                     446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
                      446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
                       446 /home/tim/.julia/Profile/src/sprofile.jl; anonymous; line: 309
                        446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
                         446 /home/tim/src/julia/usr/bin/../lib/libjulia-debug.so; ???; line: 0
                          446 ???; ???; line: 0
                           45  /home/tim/tmp/backtrace_cg.jl; doiceil; line: 9
                           317 /home/tim/tmp/backtrace_cg.jl; doiceil; line: 10
                            31 /home/tim/tmp/backtrace_cg.jl; unsafe_iceil; line: 2
                            62 /home/tim/tmp/backtrace_cg.jl; unsafe_iceil; line: 3
                            86 /home/tim/tmp/backtrace_cg.jl; unsafe_iceil; line: 4
                           43  /home/tim/tmp/backtrace_cg.jl; unsafe_iceil; line: 2
                           11  /home/tim/tmp/backtrace_cg.jl; unsafe_iceil; line: 4
                           30  ???; ???; line: 0

The oddity: unsafe_iceil now appears at two levels of the hierarchy (note the indentation). The good news: there is only one truncated backtrace, about 0.2% of the total backtraces. Likewise, profiling nest3 reveals approximately the same fraction of backtraces truncated. As a consequence, this problem no longer negatively affects the quality of SProfile's reports in any substantive way (of course, it's possible this could change if unsafe_iceil can be inlined again).

I've skimmed recent changes to codegen, but there have been a lot of them (thanks, Jeff!) and I don't have any good guesses about what caused these changes. If it would be useful to bisect any of these changed behaviors (particularly the loss of inlining unsafe_iceil), I can do that.

CC @blakejohnson

JeffBezanson commented 11 years ago

Are you sure unsafe_iceil was ever inlined?

timholy commented 11 years ago

It's a fair question; at the time I wasn't in the habit of looking at the disassembly because I'd never learned how to read it. So I went back and checked out f724a02d9a6beb7ba9a043f95cba33bee75ddfb6 from June 20th (the date I filed this) and rebuilt.

Here's what I suspect is the crucial part of the disassembly of doiceil:

idxend:                                           ; preds = %if
  %15 = load %jl_value_t** %9, !dbg !4778
  %16 = bitcast %jl_value_t* %15 to double*, !dbg !4778
  %17 = getelementptr double* %16, i64 %10, !dbg !4778
  %18 = load double* %17, !dbg !4778
  %19 = call i64 @julia_unsafe_iceil(double %18), !dbg !4778
  %20 = add i64 %"#s2.03", 1, !dbg !4778
  %21 = icmp sgt i64 %20, %7, !dbg !4777
  br i1 %21, label %L2, label %if, !dbg !4777

So, looks like no. Yet I was able to reproduce the truncated backtraces just fine. So, turning off inlining got rid of them, but in fact this function wasn't inlined.

When I updated to current (d3be2558dbf5666aedff51d050570edce651c3d0), now I'm not seeing the "new" behavior, I'm seeing the original behavior as reported in this issue. In other words, there's still something to fix.

I swear I'm not imagining these things, but this is really weird. I wonder if it might depend on what had been run previously within the same REPL session?

timholy commented 11 years ago

I've made a bit more progress (and can't really invest a lot more in this for today, so reporting in...).

By reading the libunwind source, I discovered that one can turn on debugging output. Below I've indicated the build & code changes needed to turn this on, how I ran it, and the results for one truncated backtrace and one full backtrace, both starting from within unsafe_iceil. The ## is annotation added by me, from running sprofile_lookup (which ccalls jl_lookup_code_address).

Bottom line: it looks like none of the Julia-compiled code has DWARF information, so libunwind relies on "frame-chain". Sometimes this works and sometimes it doesn't.

In searches I've found precious little about the combination llvm + dwarf + jit online; this link is perhaps the most interesting one.

Build & code changes:
--- a/deps/Makefile
+++ b/deps/Makefile
@@ -1213,7 +1213,7 @@ libunwind-$(UNWIND_VER)/configure: libunwind-$(UNWIND_VER).tar.gz
        touch -c $@
 libunwind-$(UNWIND_VER)/config.status: libunwind-$(UNWIND_VER)/configure
        cd libunwind-$(UNWIND_VER) && \
-       ./configure $(CONFIGURE_COMMON) CFLAGS="$(LIBUNWIND_CFLAGS)" --disable-shared --disable-minidebuginfo
+       ./configure $(CONFIGURE_COMMON) CFLAGS="$(LIBUNWIND_CFLAGS)" --disable-shared --disable-minidebuginfo --enable-debug --enable-debug-frame
        touch -c $@
 $(LIBUNWIND_TARGET_SOURCE): libunwind-$(UNWIND_VER)/config.status
        $(MAKE) -C libunwind-$(UNWIND_VER)
diff --git a/src/task.c b/src/task.c
index 9803d9f..660fdcf 100644
--- a/src/task.c
+++ b/src/task.c
@@ -532,6 +532,7 @@ DLLEXPORT size_t rec_backtrace(ptrint_t *data, size_t maxsize)
         }
         data[n++] = ip;
     }
+    jl_printf(JL_STDOUT, "%d unwind steps\n", n);
     return n;
 }
 #endif

Delete the libunwind-1.1 directory, and delete libjulia* and libunwind* from $(JULIADIR)/usr/lib
make

Ran julia as:
UNW_DEBUG_LEVEL=13 julia

Note:
julia> cfunction(unsafe_iceil, Int, (Float64,))
Ptr{Void} @0x00007fcd2f155010

julia> cfunction(doiceil, Nothing, (Vector{Float64},))
Ptr{Void} @0x00007fcd2f154f80

 >_ULx86_64_init_local: (cursor=0x7ffff87cc480)
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd3080c884, cfa=0x00007ffff87cc480)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd3080c884 cfa=0x7ffff87cc480 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2c91598a, cfa=0x00007ffff87ccc70)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd2c91598a cfa=0x7ffff87ccc70 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd303a84a0, cfa=0x00007ffff87ccc80)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd303a84a0 cfa=0x7ffff87ccc80 format=1 addr=0x7ffff87ccc80 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2f155025, cfa=0x00007ffff87cd0d0)   ## ("unsafe_iceil","none",4)
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
 >_ULx86_64_step: [RBP=0x7ffff87cccf8] = 0x7ffff87cd0d0 (cfa = 0x7ffff87cd0d0) -> 0x7ffff87cd100
  >_ULx86_64_step: NULL %rbp loc, returning 0
3 unwind steps

 >_ULx86_64_init_local: (cursor=0x7ffff87cc480)
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd3080c884, cfa=0x00007ffff87cc480)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd3080c884 cfa=0x7ffff87cc480 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2c91598a, cfa=0x00007ffff87ccc70)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd2c91598a cfa=0x7ffff87ccc70 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd303a84a0, cfa=0x00007ffff87ccc80)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd303a84a0 cfa=0x7ffff87ccc80 format=1 addr=0x7ffff87ccc80 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2f155010, cfa=0x00007ffff87cd0d8)   ## ("unsafe_iceil","none",2)
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
 >_ULx86_64_step: [RBP=0x7ffff87cccf8] = 0x7ffff87cd100 (cfa = 0x7ffff87cd0d8) -> 0x7ffff87cd110
 >_ULx86_64_step: Frame Chain [RIP=0xf87cd108] = 0x2f154f63
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2f154f63, cfa=0x00007ffff87cd0e8)   ## ("???","???",0)
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
 >_ULx86_64_step: [RBP=0x7ffff87cd100] = 0x7ffff87cd110 (cfa = 0x7ffff87cd0e8) -> 0x7ffff87cd190
 >_ULx86_64_step: Frame Chain [RIP=0xf87cd118] = 0x307d4b8e
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd307d4b8e, cfa=0x00007ffff87cd0f8)   ## ("???","/home/tim/src/julia/usr/bin/../lib/libjulia-release.so",0)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd307d4b8e cfa=0x7ffff87cd0f8 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2f16b0c5, cfa=0x00007ffff87cd1a0)   ## ("anonymous","/home/tim/.julia/Profile/src/sprofile.jl",309)
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
 >_ULx86_64_step: [RBP=0x7ffff87cd190] = 0x7ffff87cd340 (cfa = 0x7ffff87cd1a0) -> 0x7ffff87cd3d0
 >_ULx86_64_step: Frame Chain [RIP=0xf87cd348] = 0x30812de1
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd30812de1, cfa=0x00007ffff87cd1b0)   ## ("???","/home/tim/src/julia/usr/bin/../lib/libjulia-release.so",0)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd30812de1 cfa=0x7ffff87cd1b0 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd307d92b4, cfa=0x00007ffff87cd3e0)   ## ("???","/home/tim/src/julia/usr/bin/../lib/libjulia-release.so",0)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd307d92b4 cfa=0x7ffff87cd3e0 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2f150446, cfa=0x00007ffff87cd500)   ## ("eval_user_input","client.jl",91)
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
 >_ULx86_64_step: [RBP=0x7ffff87cd4f0] = 0x7ffff87cd850 (cfa = 0x7ffff87cd500) -> 0x7ffff87cd860
 >_ULx86_64_step: Frame Chain [RIP=0xf87cd858] = 0x2f15008b
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2f15008b, cfa=0x00007ffff87cd510)   ## ("???","???",0)
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
 >_ULx86_64_step: [RBP=0x7ffff87cd850] = 0x7ffff87cd860 (cfa = 0x7ffff87cd510) -> 0x7ffff87cd8e0
 >_ULx86_64_step: Frame Chain [RIP=0xf87cd868] = 0x307d4b8e
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd307d4b8e, cfa=0x00007ffff87cd520)   ## ("???","/home/tim/src/julia/usr/bin/../lib/libjulia-release.so",0)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd307d4b8e cfa=0x7ffff87cd520 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2f109f7d, cfa=0x00007ffff87cd8f0)   ## ("run_repl","client.jl",166)
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
 >_ULx86_64_step: [RBP=0x7ffff87cd8e0] = 0x7ffff87cd970 (cfa = 0x7ffff87cd8f0) -> 0x7ffff87cdc00
 >_ULx86_64_step: Frame Chain [RIP=0xf87cd978] = 0x2f1099df
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2f1099df, cfa=0x00007ffff87cd900)   ## ("_start","client.jl",370)
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
 >_ULx86_64_step: [RBP=0x7ffff87cd970] = 0x7ffff87cdc00 (cfa = 0x7ffff87cd900) -> 0x7ffff87cdc10
 >_ULx86_64_step: Frame Chain [RIP=0xf87cdc08] = 0x2f109260
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd2f109260, cfa=0x00007ffff87cd910)   ## ("???","???",0)
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
 >_ULx86_64_step: [RBP=0x7ffff87cdc00] = 0x7ffff87cdc10 (cfa = 0x7ffff87cd910) -> 0x7ffff87cdc90
 >_ULx86_64_step: Frame Chain [RIP=0xf87cdc18] = 0x307d4b8e
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd307d4b8e, cfa=0x00007ffff87cd920)   ## ("???","/home/tim/src/julia/usr/bin/../lib/libjulia-release.so",0)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd307d4b8e cfa=0x7ffff87cd920 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x0000000000403eda, cfa=0x00007ffff87cdca0)   ## ("???","julia",0)
     >_ULx86_64_reuse_frame: reuse frame ip=0x403eda cfa=0x7ffff87cdca0 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd3080b833, cfa=0x00007ffff87cddc0)   ## ("julia_trampoline","/home/tim/src/julia/usr/bin/../lib/libjulia-release.so",131)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd3080b833 cfa=0x7ffff87cddc0 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x0000000000403a5d, cfa=0x00007ffff87cddf0)   ## ("???","julia",0)
     >_ULx86_64_reuse_frame: reuse frame ip=0x403a5d cfa=0x7ffff87cddf0 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x00007fcd3039376d, cfa=0x00007ffff87cde10)   ## ("__libc_start_main","/lib/x86_64-linux-gnu/libc.so.6",237)
     >_ULx86_64_reuse_frame: reuse frame ip=0x7fcd3039376d cfa=0x7ffff87cde10 format=0 addr=0x0 offset=+0
  >_ULx86_64_step: returning 1
 >_ULx86_64_step: (cursor=0x7ffff87cc480, ip=0x0000000000403a99, cfa=0x00007ffff87cded0)   ## ("???","julia",0)
    >locate_debug_info: checking 0x3c671b0: 400000-631fe8
 >_ULx86_64_dwarf_search_unwind_table: e->fde_offset = 1380, segbase = 423b88, debug_frame_base = 0, fde_addr = 424f08
            >_ULx86_64_dwarf_extract_proc_info_from_fde: FDE @ 0x424f08
             >_ULx86_64_step: dwarf_step() failed (ret=-10), trying frame-chain
  >_ULx86_64_step: NULL %rbp loc, returning 0
20 unwind steps
StefanKarpinski commented 11 years ago

Great detective work, Tim. That explains why our backtraces are so generally flaky. The "old" LLVM JIT infrastructure does not emit DWARF info, whereas I believe that the new MCJIT does support that. So maybe this is one of the many things that would be fixed by making the switch. Big undertaking though. The sad thing about this is that the stack frame, if they were following the x86_64 spec, is a super simple linked list. Even when you ask for this spec to be followed, however, it isn't.

timholy commented 11 years ago

I did find an older version of that page (dating to LLVM 2.7) that seems to suggest that there was some way to get DWARF information even in earlier versions (I'm guessing this must be something they've been able to do for a while, since my limited understanding is that this should be used anytime you run a debugger). In fact, in 2.7 there seemed to have been a RegisterFunction LLVM function, but that doesn't seem to be available in the latest. Instead there's JITRegistrar, but that seems to want an object file or something (I haven't really wrapped my head around it yet).

Even presuming LLVM can emit this information, there's also the question of getting libunwind to use it.

timholy commented 11 years ago

I've read a bit more. I'm now guessing that the "right" way to handle all of these issues will be to switch to LLDB, which in addition to solving this immediate issue might also enable a whole host of other useful things (like a full debugger).

JIT support seems like it's on the road map, but not ready yet: http://blog.llvm.org/2013/06/lldb-33-and-beyond.html. Unless someone knows differently, I'm therefore planning on putting this issue on ice for the time being and see how that support evolves.

ablewhiskey commented 10 years ago

I'm running into inconsistent backtrace behavior and think it may be related to this issue. I see truncated backtraces when inducing BoundsErrors, but the extent of the truncation seems to vary depending on the array dimensionality and it seems like block depth?

I'm running on Version 0.3.0-prerelease+1564 Commit 7ddb311 on Mavericks. Is the following caused by the issue @timholy found, and therefore the expected behavior for now?

Out-of-bounds index into a 3-d array:

array_3d = zeros(Int32, (2, 2, 2))
function inner()
    array_3d[3, 3, 3]
end
function outer()
    inner()
end
outer()

produces:

ERROR: BoundsError()
 in getindex at array.jl:279
while loading /home/alex/test.jl, in expression starting on line 13

Here I lose both inner() and outer() in the backtrace.

If I do this instead with a 1-d array:

array_1d = [1,2]
function inner()
    array_1d[3]
end
function outer()
    inner()
end
outer()

produces:

ERROR: BoundsError()
 in outer at /home/alex/test2.jl:8
 in include_from_node1 at loading.jl:120
while loading /home/alex/test2.jl, in expression starting on line 10

Here I lose inner() but get outer() in the backtrace.

Finally, if I wrap the array index inside a for loop, I get the full backtrace:

array_1d = [1,2]
function inner()
    for i in 1:1000
        array_1d[3]
    end 
end
function outer()
    inner()
end
outer()

produces:

ERROR: BoundsError()
 in inner at /home/alex/test3.jl:6
 in outer at /home/alex/test3.jl:10
 in include_from_node1 at loading.jl:120
while loading /home/alex/test3.jl, in expression starting on line 12

o_O

blakejohnson commented 10 years ago

@ablewhiskey I think you are just observing different inlining behavior in your various examples. Certainly, some metadata about the line number where the error occurs is getting lost, but each example is displaying an appropriate parent method.

ihnorton commented 10 years ago

possibly useful, work-around another JIT user posted on llvm bug tracker: http://llvm.org/bugs/show_bug.cgi?id=13171

timholy commented 10 years ago

Very interesting find, Isaiah.

I noticed the date is 2012, so it seems a little unlikely this will be fixed in LLVM proper. I presume the best plan for us now is to migrate to MCJIT.

kmod commented 10 years ago

Hi all, I was having a similar-looking issue in my project, and using the debugging steps from this thread (thanks!) I managed to isolate it to what I think is a bug in libunwind. I can't tell if it's the same issue that you guys are having, but what I was running into is that libunwind will judge stack frames with no stack variables (ie rbp=rsp) as "invalid" and terminate the unwind. Here's a simple patch that changes the behavior: https://github.com/dropbox/pyston/blob/master/libunwind_patches/0001-Change-the-RBP-validation-heuristic-to-allow-size-0-.patch

Again, it works for me and it seems reasonable, but I haven't heard anything back from the libunwind list, and I don't know if it's the problem you're having so YMMV.

Also I'm running on MCJIT and, at least out of the box, I don't think it has any automatic libunwind integration. I think it should be fairly easy to add, though, since MCJIT will already emit the relevant DWARF info and it's should theoretically just be a matter of finding the right address and calling _U_dyn_register on it; I haven't gotten around to it, though, since (other than the rbp=rsp issue) I haven't seen any issues with the rbp-based unwinding.

timholy commented 10 years ago

Today is Christmas. I haven't tested extensively, but cursory examination suggests you've fixed my personal most-hated-bug-ever. Can't thank you enough.

To the rest: I'm not exactly sure what we need to do to apply a patch in one of the libraries, but we should do this!

tkelman commented 10 years ago

@timholy Should be as simple as this https://github.com/tkelman/julia/compare/patch-libunwind I took out the new header define, since all the other patches in deps get applied before calling configure, so libunwind.h doesn't exist yet (libunwind.h.in does, we could apply the patch there if we needed to keep that part).

@kmod Thanks so much for the heads up here! Pyston is a really interesting project that a lot of us are keeping an eye on.

timholy commented 10 years ago

@tkelman, thanks! Have you submitted this as a pull request? Definitely eager to see this get merged.

tkelman commented 10 years ago

Not yet but I could. Or let @kmod do the honors if he wants, since he wrote the patch. It's really only a 1-character change, but I'd still want to be careful about licenses.

Pyston appears to be Apache License v2.0, which is GPLv3 compatible but not GPLv2? Are any of the Julia dependencies GPLv2 only?

blakejohnson commented 10 years ago

Is this relevant to our libosxunwind as well?

tkelman commented 10 years ago

Bump. IANAL, and I don't want to get Julia in trouble either. I believe the portable Git we distribute with Julia Windows binaries is strictly GPL v2. @kmod can you clarify whether patches in the Pyston repository are under Apache License v2, or if you would be willing to submit that libunwind patch to Julia under the MIT license? Libunwind is also MIT licensed, so presumably they would like to know the same - unless there's some trivial patch rule where this stuff doesn't matter.

timholy commented 10 years ago

You surely saw this, but just in case: https://github.com/dropbox/pyston/blob/master/LICENSE

This patch lives in that repository, and there is no mention of separate licenses applying to separate parts.

timholy commented 10 years ago

Oh, scrolled up, and you had seen that already.

tkelman commented 10 years ago

Right, and from: https://www.gnu.org/licenses/license-list.html#apache2

Apache License, Version 2.0 (#apache2) This is a free software license, compatible with version 3 of the GNU GPL.

Please note that this license is not compatible with GPL version 2, because it has some requirements that are not in that GPL version. These include certain patent termination and indemnification provisions.

simonster commented 10 years ago

The patch is a single character deletion. Is that really copyrightable?

ihnorton commented 10 years ago

Also, we don't link against git, so the point (there at least) is moot.

tkelman commented 10 years ago

Recent legal precedent has been redefining what is copyrightable, throwing common sense out the window. I don't see why Dropbox would have any reason to take Julia to court, but there are other entities that could in the future have both the motivation and the resources to do so. How many Matlab APIs might Julia, SciPy, etc now be in hypothetical legal trouble for imitating? That question is unrelated to this one-character patch, but hopefully you can see where my unnecessary paranoia is coming from. Rather ask permission here...

JeffBezanson commented 10 years ago

It would be great to get this patch integrated. Would be best to get its author's blessing that it can be licensed as MIT.

kmod commented 10 years ago

Hi, I hereby place the patch in the public domain for you (or anyone else) to use as you please. The same disclaimer as in the Pyston (Apache 2.0) license applies, that it's released without warranties either expressed or implied.

I'll update the LICENSE to say so; also I submitted the patch for inclusion in libunwind so ideally it would actually end up getting licensed under their license, but I'm not sure the project is still actively being developed so who knows.

tkelman commented 10 years ago

Fantastic, thank you very much Kevin. Will submit a Julia PR momentarily.

Regarding the maintenance of libunwind, Arun Sharma did respond recently on their mailing list: http://lists.nongnu.org/archive/html/libunwind-devel/2014-05/msg00001.html

timholy commented 10 years ago

Thanks a bunch, Kevin!

timholy commented 10 years ago

Fixed by b5512915d73328788020be659b49f93fa3afd212

JeffBezanson commented 10 years ago

Amazing to have this fixed!