[DebugInfo] Malformed DWARF produced during LTO build

Quuxplusone commented 3 years ago


Bugzilla Link	PR48790
Status	RESOLVED FIXED
Importance	P release blocker
Reported by	Jeremy Morse (jeremy.morse.llvm@gmail.com)
Reported on	2021-01-18 09:39:17 -0800
Last modified on	2021-08-23 14:00:21 -0700
Version	trunk
Hardware	PC Linux
CC	aprantl@apple.com, dblaikie@gmail.com, greg.bedwell@sony.com, jdevlieghere@apple.com, kyulee.llvm@gmail.com, llvm-bugs@lists.llvm.org, modocache@gmail.com, paul.robinson@am.sony.com, paul_robinson@playstation.sony.com, tstellar@redhat.com, vsk@apple.com, warren_ristow@playstation.sony.com
Fixed by commit(s)	rGef0dcb506300dc9644e8000c6028d14214be9d97
Attachments	`reproducer-48790.ll` (5323 bytes, text/plain)
Blocks	PR49317
Blocked by
See also	PR49159

The reproducer below causes LLVM master (93592b726c7) to produce malformed DWARF when built with "-O3 -g -flto". Running llvm-dwarfdump -verify complains "invalid DIE reference", and our debugger becomes displeased. Reverting e08f205f / D70350, "Allow cross-CU references of subprogram definitions", fixes it for us.

Sources:

$ cat -n 1.cpp 1 struct HHH; 2 3 struct xxx { 4 HHH yyy(); 5 } zzz; 6

$ cat -n 2.cpp 1 int ggg; 2 3 struct HHH { 4 template int ccc(bbb) { 5 while (--ggg) 6 ; 7 } 8 }; 9 10 void eee() { 11 HHH fff; 12 fff.ccc([] {}); 13 } 14

Built thus (the target is important it seems): $ clang++ -O3 -g -flto -w -c 1.cpp -o 1.o -target x86_64-scei-ps4 $ clang++ -O3 -g -flto -w -c 2.cpp -o 2.o -target x86_64-scei-ps4 $ ld.lld 1.o 2.o -o 3.out -r $ llvm-dwarfdump -verify 3.out

Produces: --------8<-------- error: invalid DIE reference 0x00000046. Offset is in between DIEs:

0x000000a1: DW_TAG_inlined_subroutine DW_AT_abstract_origin (0x00000046) DW_AT_low_pc (0x0000000000000000) DW_AT_high_pc (0x0000000000000002) DW_AT_call_file ("1.cpp") DW_AT_call_line (12) -------->8--------

I'm attaching an IR reproducer, derived from "ld.lld -save-temps" and the combined-and-optimized IR it produces. Feeding that IR into llc also produces an invalid DIE reference.

I've done some prodding, and while to me the DWARF emission code is a twisty turny maze of passages all alike, I believe the problem is that the DIE for the "eee" function is placed in the wrong compile unit. Here's a chain of events:

The DIE for "HHH" is placed in the CU for 1.cpp, because the "zzz" global var is emitted there.
The DIE for "ccc" is created in the context of "HHH", CU is 1.cpp
The template argument for "ccc", a lambda type, is created
The scope for that template argument is needed: so a DIE "eee" is created.
That DIE for "eee" is in the "current" unit, 1.cpp

Normally this wouldn't be a problem. I believe that at the end of DwarfDebug::endFunctionImpl, normally a duplicate DIE would be created for "eee", because a cross-CU subprogram wouldn't be permitted. However, with e08f205f a cross-CU subprogram is permitted.

All the child scopes of "eee" are created before "eee" is created, they expect to be in the CU that the DISubprogram for "eee" specifies, 2.cpp, and they pick references of form "DW_FORM_ref4". They're then attached to a subprogram DIE in a different CU, breaking those forms of references. If you use llvm-dwarfdump -v, you can see that the offset of the DW_AT_abstract_origin above would be correct, if it was in the correct CU (hat-tip James Henderson).

I don't really have a proposal for how to fix this right now, DWARF emission is way out of my comfort zone.

Quuxplusone commented 3 years ago

Attached reproducer-48790.ll (5323 bytes, text/plain): LLVM-IR to reproduce when fed to llc

Quuxplusone commented 3 years ago

Yep, looks like a bug to me.

Slightly smaller repro (& also reproduces on x86, and with full (not only -r)
linking):

1.cpp:
  struct HHH;
  HHH *zzz;

2.cpp:
  void __attribute__((optnone)) __attribute__((nodebug)) f1()  { }

  struct HHH {
    template <typename bbb>
    static int __attribute__((always_inline)) ccc() {
      f1();
    }
  };

  int main() {
    struct local { };
    HHH::ccc<local>();
  }

As for fixing it, I think the most likely culprit/place to start would be the
logic that places the definition of "eee" (or "main" in my example) in the
wrong CU. I don't think there's any good reason for that to happen.

(I was hoping to tickle a possibly more severe version of this bug by making
eee file-scoped static (because then moving it between CUs could be really
problematic in other ways - if there was another static function with the same
name in that other CU, etc) but seems something different happens in that case
that doesn't trigger the bug - not sure why, don't think we do anything
significantly different in the DWARF for file-scoped static entities)

Quuxplusone commented 3 years ago

The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me.
Compiled separately, 1.cpp would obviously have only a forward declaration.
I take it with LTO, 2.cpp ends up with a cross-CU reference to the HHH in
1.cpp?  Possibly a side-effect of ctor homing?  (I haven't looked at the
insides of this at all, I'm just doing my usual black-box guessing based on
the symptoms.)

Quuxplusone commented 3 years ago

> The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me.
> Compiled separately, 1.cpp would obviously have only a forward declaration.

It's the presence of the global that's causing it to be emitted in 1.cpp. I
imagine something similar to the subprogram happens to the type: it just gets
created in the "current" CU.

> I take it with LTO, 2.cpp ends up with a cross-CU reference to the HHH in
> 1.cpp?

Yup,

> Possibly a side-effect of ctor homing?

I don't believe ctor homing is turned on by default yet, I see -debug-info-
kind=limited and -fuse-ctor-homing is absent in my clang -### commands.

Quuxplusone commented 3 years ago

(CC'ing modocache as he reported reverting D70350 downstream after experiencing difficulties with it).

Quuxplusone commented 3 years ago

Possible patch: https://reviews.llvm.org/D94976

Quuxplusone commented 3 years ago

(In reply to Paul Robinson from comment #3)
> The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me.
> Compiled separately, 1.cpp would obviously have only a forward declaration.
> I take it with LTO, 2.cpp ends up with a cross-CU reference to the HHH in
> 1.cpp?  Possibly a side-effect of ctor homing?  (I haven't looked at the
> insides of this at all, I'm just doing my usual black-box guessing based on
> the symptoms.)

Type definitions (& declarations) are deduplicated at the LLVM IR level based
on their mangled name to reduce IR (& resulting DWARF) size. This is
expected/desired - perhaps slightly not-what-the-source-says (but that's always
going to be the case - if the type was defined in both files, we still wouldn't
want to produce definitions in both due to size concerns).

Quuxplusone commented 3 years ago

& yeah, nothing to do with ctor or other (vtable, etc) homing strategies - IR debug info metadata type deduplication would be done even with -fstandalone-debug (ie: without any type homing enabled).

For what it's worth: Type homing is a purely frontend/IR generation feature and LLVM has no knowledge of or explicit support for type homing.

IR Type deduplication does rely on the frontend providing a mangled name in the type description metadata - this is how the frontend signals to LLVM that the type is subject to the ODR and so LLVM can drop duplicates based on name alone.

Quuxplusone commented 3 years ago

Hi Tom, could you pull-up / cherry-pick the fix for this ticket, ef0dcb50630 [0], into the LLVM 12 release branch? In rare circumstances LTO will produce a file with illegal DWARF syntax.

I don't think it's needed for rc1.

[0] https://reviews.llvm.org/rGef0dcb506300dc9644e8000c6028d14214be9d97

Quuxplusone commented 3 years ago

Hi Tom, could you pull-up / cherry-pick the fix for this ticket

Actually, hold on this, there's a report in D95622 that it's causing a crash. I'd still like this to be in llvm-12, but best to wait a little bit.

Quuxplusone commented 3 years ago

Sitrep: D94976 continues to run into trouble. There's an increased risk that this won't make llvm-12. Technically this isn't a regression because the buggy behaviour is in llvm-11, but emitting broken DWARF is far from ideal, of course.

Quuxplusone commented 3 years ago

I'm getting the impression that cross-CU DIE emission is being done in an
ad-hoc on-demand way, and keeping the contexts straight runs into all sorts
of corner cases.
Would we be better off reorganizing it so we emit one CU at a time, and
keep some sort of side/work list of 'stuff some other CU needs'?
This would require deferring filling in the actual value of the reference
because we don't know the section offset yet; either as something to be
backpatched when we get around to emitting the other DIE, or by using
relocations and letting the linker deal with it.
The cross-CU references would all be DW_FORM_ref_addr so the size is known.

Quuxplusone commented 3 years ago

(In reply to Paul Robinson from comment #12)
> I'm getting the impression that cross-CU DIE emission is being done in an
> ad-hoc on-demand way, and keeping the contexts straight runs into all sorts
> of corner cases.
> Would we be better off reorganizing it so we emit one CU at a time,

Practically, we can't - CUs are built as functions are code generated - and not
all the functions for a given CU are grouped together. I guess we could change
that, but it'd be a significant redesign of DWARF building code & I'm not sure
it'd significantly improve this issue.

> and
> keep some sort of side/work list of 'stuff some other CU needs'?
> This would require deferring filling in the actual value of the reference
> because we don't know the section offset yet;

Perhaps an aside, but FWIW, that part of the architecture already exists - we
don't compute the DIE offsets immediately (because we go back and add more
attributes to DIEs later on, for instance - so we can't know offsets) - so when
we created a cross-DIE reference, one DIE's attribute refers/points to the
other DIE then later on when they're emitted it's two-pass, one commits offsets
(says "we're not changing anything, so figure out how big everything is/what
the offsets are) and then emitting the bytes where the attribute emission can
query the final offset of the DIE being referred to and use that.

>  either as something to be
> backpatched when we get around to emitting the other DIE, or by using
> relocations and letting the linker deal with it.

There are some other benefits if we were to move to using relocations for a lot
more DWARF output, including for DIE offsets. (would make for more
editable/legible DWARF assembly) but probably not much difference to this
situation.

(In reply to Jeremy Morse from comment #11)
> Sitrep: D94976 continues to run into trouble. There's an increased risk that
> this won't make llvm-12. Technically this isn't a regression because the
> buggy behaviour is in llvm-11, but emitting broken DWARF is far from ideal,
> of course.

Yep, I think it'd be best to not try to push this into LLVM 12 - as we've seen,
there's a lot of nuance in this sort of code.

Quuxplusone commented 3 years ago

Oh, I shuold also say it'd be complicated to emit a whole CU in a standalone
way - in general CU IR Metadata doesn't contain a comprehensive list of its
contents. In the interests of reducing/eliding dead DWARF, most of the entities
refer to the CU as their context, but without the CU listing/knowing about that
item.

eg: a simple inlining.
 CU1 and CU2 metadata exist but refer to no entities.
 SP2 refers to CU2 as its unit.
 SP1 refers to CP1 as its unit.
 There's an llvm::Function func2.
 func2's subprogram field refers to SP2.
 some instructions in func2 have a debug location that refers to SP1

So the only way SP1 is reachable is through func2

Quuxplusone commented 3 years ago

What's the status of this bug? Have the problems with the fix been resolved?

Quuxplusone commented 3 years ago

Hi Tom,

What's the status of this bug? Have the problems with the fix been resolved?

They have not, and it seems unlikely that it'll be sufficiently addressed and tested in time for 12.0.

Technically this isn't a regression -- the behaviour behind this bug was present in llvm-11, and it appears to trigger rarely. We only ran into it on a very large C++ code base when using LTO. It can likely be left until 12.0.1, David agrees in comment 13

An alternative is to revert e08f205f, which unmasked this problem, and is what we're doing downstream. I believe that would decrease the quality of DWARF5 call site information, @aprantl @vsk would you be able to give an opinion on how you'd weight reverting versus sticking with the current behaviour?

Quuxplusone commented 3 years ago

This bug was not resolved in time for the 12.0.0 release, so it will have to wait for 12.0.1.

If you feel this is a high-priority bug that should be fixed for 12.0.0, please re-add release-12.0.0 to the Blocks field and add a comment explaining why this is high-priority.

Quuxplusone commented 3 years ago

This bug still repros in the latest upstream compiler
We've seen this case (the same verification error) frequently in LTO build for
large applications with XCode.
I don't have a cutdown but backing out e08f205f / D70350 seems to fix the issue
for us. I wonder whether there is a follow-up on this bug?

Quuxplusone commented 3 years ago

To respond to Jeremy's question (from comment #16): reverting D70350 might regress the debugging experience for some of our low-level projects that build with LTO, but it's not something to rule out, esp. if D94976 won't land any time soon (and, of course, given that malformed DWARF is no good).

Quuxplusone commented 3 years ago

I suspect D94976 isn't going anywhere soon -- there seems to be a significant amount of context encoded in "the current unit" that I don't fully understand yet.

FTR, we've reverted D70350 downstream as it affects a codebase that's important to us. I'd suggest that we revert in llvm main, seeing how it's affecting more than one person/org, and broken DWARF is more of a blocker than reduced debug experience (alas).

Quuxplusone commented 3 years ago

Uploaded a revert-and-regression-test patch at https://reviews.llvm.org/D107076

Quuxplusone commented 3 years ago

Ultimately this was reverted in d4ce9e463d51 and cherry-picked into llvm-13 via bug 51426.

Quuxplusone / LLVMBugzillaTest

[DebugInfo] Malformed DWARF produced during LTO build #47759