Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

LLVM 6.0 compile time regression with large amounts of debuginfo #35390

Closed Quuxplusone closed 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR36417
Status RESOLVED FIXED
Importance P enhancement
Reported by Alex Crichton (alex@crichton.co)
Reported on 2018-02-16 23:01:09 -0800
Last modified on 2018-03-06 08:46:06 -0800
Version trunk
Hardware PC All
CC aprantl@apple.com, bjorn.a.pettersson@ericsson.com, ditaliano@apple.com, hans@chromium.org, karl-johan.karlsson@ericsson.com, llvm-bugs@lists.llvm.org, paul_robinson@playstation.sony.com, rnk@google.com
Fixed by commit(s)
Attachments test.bc.gz (749945 bytes, application/x-gzip)
another.bc (452092 bytes, application/octet-stream)
bugpoint-reduced-simplified.ll (95772 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 19883
slow bytecode

We've recently upgraded to LLVM 6.0 (or at least what's close to the release_60
branch of LLVM) in Rust, but unfortunately we're seeing a regression in compile
time now!

First reported to us [1] it looks like the compile time to emit debug
information has increased dramatically between LLVM 4.0 and LLVM 6.0. Doing
some testing locally I've managed to extract a test case for LLVM I think. The
attached IR completes codegen in ~2s on LLVM 4.0 but takes minutes on LLVM 6.0

I ran a bisection based on the results of the command:

    llc test.bc -filetype=obj -O2 -relocation-model=pic

and this bisection pointed at https://reviews.llvm.org/D38229. I've cc'd those
involved with that patch here, would y'all mind helping take a look into this
regression? Also this may also just be rustc not producing debug information
that LLVM is expecting, so if something's off about that please just let us
know!

[1]: https://github.com/rust-lang/rust/issues/48226
Quuxplusone commented 6 years ago

Attached test.bc.gz (749945 bytes, application/x-gzip): slow bytecode

Quuxplusone commented 6 years ago

I'm not sure we'll actually end up blocking 6.0 on this, but let's keep an eye on it.

Quuxplusone commented 6 years ago

Do you have the original Rust code? I have tried to understands from the debug information what the Rust code looks like. A large amount of the debug values in the attached reproducer seems to be related to the handling of the array named 'x' in the original code. It seems to be an array of a struct (&str, usize) with 77 entries originating from line 1 of the file x11-dl-2.17.2/src/lib.rs

Quuxplusone commented 6 years ago

Attached another.bc (452092 bytes, application/octet-stream): another slow file

Quuxplusone commented 6 years ago
Sure yeah! The original source for this is this crate

* https://crates.io/crates/x11-dl/2.17.2

with the source code hosted roughly at

* https://github.com/Daggerbot/x11-
rs/tree/d66b29f2c898aa9d26c6bd13fa067b8656a8c6e5/x11-dl/src

The Rust compiler, however, automatically splits up crates into object files
(and also deals with monomorphization/inlining generics/etc), and I think the
slowdown is actually a pattern not directly present in this crate but rather
the standard library.

I've uploaded another bytecode file that the Rust compiler is producing, and
this is the unit of compilation, at least in rustc, which seems to be causing
the major slowdown. (I basically told rustc to make as granular codegen units
as possible and this is the one that took the longest).

Most of the symbols here are related to std::sync::Once, defined here --

* https://github.com/rust-
lang/rust/blob/27a046e9338fb0455c33b13e8fe28da78212dedc/src/libstd/sync/once.rs#L220

Unfortunately I wasn't able to create a standalone test case using just this
which exhibited the same slowdown :(

Does that help though? Or would it be more useful to have a minimal example
from Rust source code as well?
Quuxplusone commented 6 years ago
The Rust source you pointed me at don't contain any local variable named 'x' in
line 1 of the file x11-dl-2.17.2/src/lib.rs
I guess that the debug info in test.bc point out the wrong location in the Rust
source code.

To me it feels like something is wrong with the debug values generated for the
code when using -relocation-model=pic. At least I don't understands what it is
doing. I need more time to figure out what's going on here.

I can't use the second attachment ("another slow file") when trying to
disassemble the bytecode (with llvm-dis) I get a lot of errors and it end with
a "LLVM ERROR: Broken module found, compilation aborted!"
Quuxplusone commented 6 years ago

This is my view of the problem:

The patch D38229 (a change in the pass Live Debug Variables) will on the input given in test.bc duplicate the debug values over and over again. The pass itself will not increase in compile time but a later pass Dwarf Debug will take a lot extra time when it try to handle all those extra debug values.

However, the reason for Live Debug Variables to duplicate the debug values over and over again is because the debug information given as input to Live Debug Variables are broken. This have been broken even before the introduction of the patch D38229 (probably even in LLVM 4.0) but it have not shown in increased compile time (but only in broken debug information that lead to wrong value printed in a debugger (gdb) when debugging optimized code).

The origin of the fault seems to be that the "X86 DAG->DAG Instruction Selection" pass don't always know where to insert debug values and then seems to insert them in the beginning of the basic block. Then we have a use of a virtual register that is not defed yet. That is bogus and have no meaning. I guess that the verifiers don't check liveness for debug values. Maybe it is possible to extend verifier (for debug info) to check for this.

When the Live Debug Variable pass remove the debug values before register allocation it interpret this as the debug value is live from the basic block start (which is not true). I suggest that the Live Debug Variable should detect debug values with uses of virtual registers that are not defed and simply throw away those debug values (at least until a real fix in the DAG->DAG Instruction Selection pass in available).

Then someone need to look at the DAG->DAG Instruction Selection pass and figure out why it insert DBG_VALUES at the start of the basic block.

Quuxplusone commented 6 years ago

Attached bugpoint-reduced-simplified.ll (95772 bytes, text/plain): llc bugpoint-reduced-simplified.ll -O2 -relocation-model=pic -o bugpoint-reduced-simplified.s -print-after-all -debug

Quuxplusone commented 6 years ago

Thanks for your analysis, Karl-Johan! We spent a lot of time optimizing LiveDebugValues, so a performance degradation in this pass usually points to some earlier pass inserting too many debug intrinsics, as you also found out.

Quuxplusone commented 6 years ago

Karl-Johan er sorry if the last bytecode I posted didn't work! Not entirely sure what was up with that... It sounds, though, like you've narrowed it down elsewhere?

Also, when you say that the debuginfo is erroneous, is that the output of rustc which is erroneous or LLVM's internal representation?

Quuxplusone commented 6 years ago
(In reply to Alex Crichton from comment #9)
> Karl-Johan er sorry if the last bytecode I posted didn't work! Not entirely
> sure what was up with that... It sounds, though, like you've narrowed it
> down elsewhere?
>
> Also, when you say that the debuginfo is erroneous, is that the output of
> rustc which is erroneous or LLVM's internal representation?

I think this comment from Karl-Johan answers that question: "The origin of the
fault seems to be that the "X86 DAG->DAG Instruction Selection" pass don't
always know where to insert debug values and then seems to insert them in the
beginning of the basic block."

Without much knowledge about rustc our conclusion was that it is the
instruction selection that transforms IR like this:
  %a = ...
  dbg.value( "x[0] is in %a" )
  %b = ...
  dbg.value( "x[1] is in %b" )
  %c = ...
  dbg.value( "x[2] is in %c" )
into
  DBG_VALUE( "x[0] is in %vrega" )
  DBG_VALUE( "x[1] is in %vregb" )
  DBG_VALUE( "x[2] is in %vregc" )
  %vrega = instr1
  %vregb = instr2
  %vregc = instr3

When the DBG_VALUE instructions are referring to a virtual register that isn't
defined yet the backend it will be mapped to incorrect slot index ranges during
register allocation (and live debug variable analysis).

If for example the register allocator is reusing the same physical register
(ecx) for all those virtual register we end up with something like this after
register allocation etc:
  DBG_VALUE( "x[0] is in %rcx" )
  DBG_VALUE( "x[1] is in %rcx" )
  DBG_VALUE( "x[2] is in %rcx" )
  %rcx = instr1
  DBG_VALUE( "x[0] is in %rcx" )
  DBG_VALUE( "x[1] is in %rcx" )
  DBG_VALUE( "x[2] is in %rcx" )
  %rcx = instr2
  DBG_VALUE( "x[0] is in %rcx" )
  DBG_VALUE( "x[1] is in %rcx" )
  DBG_VALUE( "x[2] is in %rcx" )
  %rcx = instr3
  DBG_VALUE( "x[0] is in %rcx" )
  DBG_VALUE( "x[1] is in %rcx" )
  DBG_VALUE( "x[2] is in %rcx" )

If stepping in the debugger it will look like x[0]==x[1]==x[2]==%rcx, but that
is not really true.

The correct code after ISel for my example above should be
  %vrega = instr1
  DBG_VALUE( "x[0] is in %vrega" )
  %vregb = instr2
  DBG_VALUE( "x[1] is in %vregb" )
  %vregc = instr3
  DBG_VALUE( "x[2] is in %vregc" )
and that would result in code like this after register allocation etc:
  %rcx = instr1
  DBG_VALUE( "x[0] is in %rcx" )
  %rcx = instr2
  DBG_VALUE( "x[1] is in %rcx" )
  %rcx = instr3
  DBG_VALUE( "x[2] is in %rcx" )

As seen above the amount of DBG_VALUE instructions increase after register
allocation when the DBG_VALUE instructions are scheduled before the live range
of the virtual register that it is associated with. In your example with "slow
bytecode" we ended up with lots of DBG_VALUE instructions which slowed down the
last passes in the compiler. Most of those DBG_VALUE instructions were faulty,
in the sense that they would give incorrect information to the debugger.

One solution is to fix the fault in ISel. Another solution could be too try to
discard this kind of faulty DBG_VALUE instructions (with use before def of a
virtual register) before register allocation. The latter solution could avoid
faulty values when debugging (instead you'll get <optimized out>), and it could
help out to avoid the slow down when compiling. However, fixing the ISel would
be needed to also get richer (and still correct) debug information.
I actually think that we should try to implement both solutions. Both
correcting ISel and also making the later passes more robust to bad input.

PS. I hope the kind of pseudo-code I used above is possible to understand.

Regards,
Björn (co-worker to Karl-Johan who will be on vacation for a few days)
Quuxplusone commented 6 years ago

Ok sounds good to me, thanks for the info! If y'all need anything else from our end please just let me know!

Quuxplusone commented 6 years ago

I don't think we can block the 6.0 release on this.

Quuxplusone commented 6 years ago
Here is a proposal patch to let LiveDebugVariables discard DBG_VALUE
instructions that otherwise could result in faulty debug information being
generated:
  https://reviews.llvm.org/D43956
Quuxplusone commented 6 years ago

Ok I've now also confirmed that the https://reviews.llvm.org/D43956 fixes the issue on our end as well, thanks so much Bjorn!