Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[LTO] -fwhole-program-vtables causes bad generated code in hybrid lto/non-lto build #44682

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR45712
Status NEW
Importance P normal
Reported by Sylvain Audi (sylvain.audi@ubisoft.com)
Reported on 2020-04-28 07:18:06 -0700
Last modified on 2020-05-21 12:03:35 -0700
Version trunk
Hardware PC Windows NT
CC alex_toresh@yahoo.fr, llvm-bugs@lists.llvm.org, peter@pcc.me.uk, rnk@google.com, tejohnson@google.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
When trying to enable Thin-LTO on one of our games, we found invalid generated
code which caused an infinite loop when we enabled -the `-fwhole-program-
vtables`.
The bug also appeared when we tried LTO.

I mentionned this bug in comments of D55153

Note that we build the game against some libraries that were not built with
LTO/ThinLTO.

Here it is, reduced as a relatively small repro:

$cat main.cpp
#include "inc.h"
extern void UseExternal();
int main() {
  UseExternal();
  return 0;
}

$cat inc.h
class A {
public:
  virtual void f() {
    pNext->f();
  }
  A* pNext = nullptr;
};

class B : public A {
public:
  virtual void f() {}
};

$cat external.cpp
#include "inc.h"
static A s_A;
static B s_B;
void UseExternal() {
  s_A.pNext = &s_B;
  (&s_A)->f();
}

$cat def.cpp
#include "inc.h"
A g_A;

$clang-cl main.cpp def.cpp -c /Z7 /O2 /GR- -flto -fwhole-program-vtables
$clang-cl extern.cpp -c /Z7 /O2 /GR-
$lld-link /NODEFAULTLIB /ENTRY:main /OUT:main.exe def.obj main.obj external.obj

In the generated code, we can see that A::f() is generated as an endless loop.

14000103f: cc       int3
140001040: eb fe    jmp    0x140001040     ; endless loop
140001042: cc       int3

It looks like it was wrongly devirtualized:
  virtual void f() {
    pNext->f();
  }
  where pNext->f() becomes an endless recursion.

IMPORTANT: The bug appears only if def.obj is the first object in the lld-link
command.

This bug is pretty fragile to reproduce.
The repro above is with clang-cl / lld-link; I didn't manage to repro it under
clang/ld.lld.
Quuxplusone commented 4 years ago

I can see why the incorrect devirtualization is done. When def.obj is linked first, it includes a definition of the vtable for A (but not for B), and it is marked with LinkageUnit level vcall_visibility metadata. So the LTO link only sees a single vtable def and thinks that it is safe to devirtualize.

With clang++ it doesn't occur, because the vtable def for A in def.o has Public vcall visibility, so LTO knows it isn't safe to devirtualize.

When def.obj isn't first on the link line, the prevailing defs come from external.obj, which is native code so LTO doesn't see any vtable defs and can't devirtualize.

I don't know much about the COFF compiler, hopefully someone who is (pcc@?) can comment as to why the default visibility is different there.

Quuxplusone commented 4 years ago

According to: https://clang.llvm.org/docs/LTOVisibility.html

"When targeting Windows, classes with dllimport or dllexport attributes receive public LTO visibility. All other classes receive hidden LTO visibility."

So is this a source code issue where it should be marked with a dll attribute?

Quuxplusone commented 4 years ago
Or just mark it with an lto visibility attribute. From further down that page:

"A class defined in a translation unit built without LTO receives public LTO
visibility regardless of its object file visibility, linkage or other
attributes."
...
"1. As a corollary of the above rules, if a linkage unit is produced from a
combination of LTO object files and non-LTO object files, any hidden visibility
class defined in both a translation unit built with LTO and a translation unit
built without LTO must be defined with public LTO visibility in order to avoid
an ODR violation."
...
Classes that fall into either of these categories can be marked up with the
[[clang::lto_visibility_public]] attribute.
Quuxplusone commented 4 years ago

All the external libraries are statically linked, we're not using dll-importing because of the runtime cost of the DLL thunks.

We need a way to detect these errors early on, not wait at runtime. In this specific case, the error occurred shortly after booting the program, but we're worried there might be other dormant issues like this.

Maybe using ASAN will be one way to detect such errors? (https://github.com/google/sanitizers/wiki/AddressSanitizerOneDefinitionRuleViolation) But again it's tedious because the detection occurs at runtime.

Do you think it would be possible to warn in the linker?

Even at that, we're using quite a few commercial libraries which we don't have the code for. Would that mean that we'd somehow need to tag all those public APIs with [[clang::lto_visibility_public]] ? (or maybe a compiler option?)

Quuxplusone commented 4 years ago

I think the short answer is that -fwhole-program-vtables only works when the entire DSO participates in LTO, i.e. all the vtable uses are in LLVM IR. The feature sort of assumes that most statically linked code will be IR compiled from source. The feature overloads the platform's standard DSO visibility markers (default/hidden visibility or dllimport/export) to detect classes that are used across DSO (and therefore LTO) boundaries. The LTO visibility markers exist as an escape hatch for cases when those DSO markers don't exactly match the native/bitcode boundaries.

I suspect it's infeasible for the linker to warn about this. You don't need to link against the vtable from native code to rely on its layout. A plain virtual call does not reference the class vtable or any other symbols.

We could come up with a safer mode where the defaults are inverted: only classes carrying an explicit attribute (and perhaps their derived classes) participate in vtable optimization in LTO. Maybe this mode already exists. I think there is prior art in the -flto-visibility-public-std flag.

Apologies for any innaccuracies above, I am not deeply familiar with this feature.

Quuxplusone commented 4 years ago
I think in our case, we are in a similar context from the runtime libraries
when building with /MT: we statically link with objects that were not built
with LTO.

ODR with LTO visibility:
------------------------

If I understand correctly, our problem is an ODR violation in this scenario:
 - a lib (e.g. runtime lib) was built using its own headers without LTO (all classes have public LTO visibility)
 - Program built with LTO uses the same headers, which classes get hidden LTO visibility.

Our crash with -fwhole-program-vtables:
--------------------------------------

Invalid devirtualization may happen when a class vtable is defined in both an
LTO and non-LTO translation units.

A class vtable can be defined:
 1) in a single translation unit: the one that defines its key function (basically, the first non-inline non-pure virtual method). No ODR here.

 2) otherwise in any translation unit that uses that class. *This is the problematic case*. Pure abstract interfaces, templated / fully inlined polymorphic classes, all have that problem.

Mitigation:
-----------

-flto-visibility-public-std  addresses this based on namespace (all classes in
namespace "std" or "stdext" have public LTO visibility).

For other libs, a similar namespace-based solution might work; or maybe
specifying include folders for which lto visibility is forced to public, in a
similar manner as system includes.

Our main issue is the validity of our builds:
Building the game with LTO, linked statically with non-LTO external libraries
is theoratically a potential source of ODR violations for LTO visibility. It
may even silently generate crashing code, as we experienced with -fwhole-
program-vtables.

We would need a way to verify that we don't have such problems.