Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Improve code size with a more space-efficient vtable ABI #26722

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR26723
Status NEW
Importance P normal
Reported by Peter Collingbourne (peter@pcc.me.uk)
Reported on 2016-02-24 02:00:19 -0800
Last modified on 2021-05-13 13:55:58 -0700
Version trunk
Hardware PC All
CC agrieve@google.com, compnerd@compnerd.org, daniel.kiss@arm.com, dexonsmith@apple.com, froydnj@gmail.com, hans@chromium.org, hfinkel@anl.gov, kcc@google.com, krasin@google.com, llvm-bugs@lists.llvm.org, llvm@sunfishcode.online, pageexec@gmail.com, paul_robinson@playstation.sony.com, rafael@espindo.la, rjmccall@apple.com, rnk@google.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
A more space-efficient representation for virtual tables would store 32-bit
address-point-relative offsets to virtual functions instead of virtual
function pointers. By using this representation, we avoid the need to emit
dynamic relocations for the virtual function pointers. On ELF, taking into
account the size of the dynamic relocation entry, this would save 32-4=28
bytes per virtual table slot on 64-bit architectures, or 16-4=12 bytes per
slot on 32-bit architectures.

This is an ABI break, but we can practically deploy it with whole program
visibility, which can be assured at LTO time using something along the lines
of the bitset metadata currently produced by -fwhole-program-vtables.

By avoiding these dynamic relocations, this would also save the program some
startup time, and when RTTI is disabled there would be no need to store any
dynamically relocated pointers in the vtable region, so the vtable pages could
be shared among processes.

See also PR24782. I suspect that vtables are writable on Mac because of
these dynamic relocations.

Code generation example:

#include <stdint.h>

struct A;

struct A_vtable {
  int32_t f_offset;
  int32_t g_offset;
  int32_t h_offset;
  void (*i)(A *);
};

struct A {
  A_vtable *vtable;
};

void fcall(A *a) {
  auto vt = a->vtable;
  auto fp = reinterpret_cast<void (*)(A *)>(reinterpret_cast<char *>(vt) + vt->f_offset);
  fp(a);
}

void gcall(A *a) {
  auto vt = a->vtable;
  auto fp = reinterpret_cast<void (*)(A *)>(reinterpret_cast<char *>(vt) + vt->g_offset);
  fp(a);
}

void hcall(A *a) {
  auto vt = a->vtable;
  auto fp = reinterpret_cast<void (*)(A *)>(reinterpret_cast<char *>(vt) + vt->h_offset);
  fp(a);
}

void icall(A *a) {
  a->vtable->i(a);
}

Generated code (x86-64):

0000000000000000 <_Z5fcallP1A>:
   0:   48 8b 07                mov    (%rdi),%rax
   3:   48 63 08                movslq (%rax),%rcx
   6:   48 01 c1                add    %rax,%rcx
   9:   ff e1                   jmpq   *%rcx
   b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000000010 <_Z5gcallP1A>:
  10:   48 8b 07                mov    (%rdi),%rax
  13:   48 63 48 04             movslq 0x4(%rax),%rcx
  17:   48 01 c1                add    %rax,%rcx
  1a:   ff e1                   jmpq   *%rcx
  1c:   0f 1f 40 00             nopl   0x0(%rax)

0000000000000020 <_Z5hcallP1A>:
  20:   48 8b 07                mov    (%rdi),%rax
  23:   48 63 48 08             movslq 0x8(%rax),%rcx
  27:   48 01 c1                add    %rax,%rcx
  2a:   ff e1                   jmpq   *%rcx
  2c:   0f 1f 40 00             nopl   0x0(%rax)

0000000000000030 <_Z5icallP1A>:
  30:   48 8b 07                mov    (%rdi),%rax
  33:   ff 60 10                jmpq   *0x10(%rax)

Generated code (Thumb-2):

00000000 <_Z5fcallP1A>:
   0:   6801        ldr r1, [r0, #0]
   2:   680a        ldr r2, [r1, #0]
   4:   4411        add r1, r2
   6:   4708        bx  r1

00000008 <_Z5gcallP1A>:
   8:   6801        ldr r1, [r0, #0]
   a:   684a        ldr r2, [r1, #4]
   c:   4411        add r1, r2
   e:   4708        bx  r1

00000010 <_Z5hcallP1A>:
  10:   6801        ldr r1, [r0, #0]
  12:   688a        ldr r2, [r1, #8]
  14:   4411        add r1, r2
  16:   4708        bx  r1

00000018 <_Z5icallP1A>:
  18:   6801        ldr r1, [r0, #0]
  1a:   68c9        ldr r1, [r1, #12]
  1c:   4708        bx  r1

Generated code (ARM64):

0000000000000000 <_Z5fcallP1A>:
   0:   f9400008    ldr x8, [x0]
   4:   b9800109    ldrsw   x9, [x8]
   8:   8b090101    add x1, x8, x9
   c:   d61f0020    br  x1

0000000000000010 <_Z5gcallP1A>:
  10:   f9400008    ldr x8, [x0]
  14:   b9800509    ldrsw   x9, [x8,#4]
  18:   8b090101    add x1, x8, x9
  1c:   d61f0020    br  x1

0000000000000020 <_Z5hcallP1A>:
  20:   f9400008    ldr x8, [x0]
  24:   b9800909    ldrsw   x9, [x8,#8]
  28:   8b090101    add x1, x8, x9
  2c:   d61f0020    br  x1

0000000000000030 <_Z5icallP1A>:
  30:   f9400008    ldr x8, [x0]
  34:   f9400901    ldr x1, [x8,#16]
  38:   d61f0020    br  x1

So overall code size penalty for call sites is 1 register (on each
architecture) plus 6 bytes (x86-64), 2 bytes (Thumb-2) or 4 bytes (ARM64) of
instructions per call site. Because all byte displacements would be halved
on 64-bit architectures, costs would be smaller for slots 16-31 on x86-64,
as these can now be called using an 8-bit ModR/M displacement, saving 3 bytes.

I think this will most likely be worth it. There would need to be approximately
5 (x86-64), 6 (Thumb-2) or 7 (ARM64) call sites per vtable slot for the
costs in code size to outweigh the benefits. Because each derived class adds
a new set of slots, the benefit would scale linearly with the size of the
class hierarchy.
Quuxplusone commented 8 years ago

I can report that a prototype of this was able to shrink Chromium's code size by 9%. For more details, see the Chromium bug report: https://bugs.chromium.org/p/chromium/issues/detail?id=589384

I intend to send a proposal to llvm-dev and cfe-dev imminently.

Quuxplusone commented 8 years ago

You're relying on an assumption — that the virtual function (or a thunk thereof) lies at a 32-bit linker-constant offset from the v-table — that should be equally sufficient for the linker to guarantee that the virtual function pointer can merely be slid, rather than dynamically looked up. If ELF really imposes 3 full pointers of overhead for every pointer it merely needs to slide at load time, allow me to recommend tackling that glaring code-size inefficiency before playing around with variant v-table ABIs.

Note also that your calculation is not accounting for the potential code-size cost of a thunk when the virtual function resolves outside of the linkage unit. This should be a relatively rare occurrence, of course.

Anyway, this is an interesting idea. I am extremely skeptical that a backend analysis would actually be able to achieve this safely, but certainly some combination of frontend heuristics and source annotations ought to be sufficient to allow a class-specific variant v-table ABI. It probably needs to be opt-in, but that's presumably fine for for Chromium's purposes.

Quuxplusone commented 8 years ago
(In reply to comment #2)
> You're relying on an assumption — that the virtual function (or a thunk
> thereof) lies at a 32-bit linker-constant offset from the v-table — that
> should be equally sufficient for the linker to guarantee that the virtual
> function pointer can merely be slid, rather than dynamically looked up.  If
> ELF really imposes 3 full pointers of overhead for every pointer it merely
> needs to slide at load time, allow me to recommend tackling that glaring
> code-size inefficiency before playing around with variant v-table ABIs.

For what it is worth, mozilla wrote a hack to do that:

https://github.com/mozilla/gecko-dev/blob/master/build/unix/elfhack/elfhack.cpp

The non hack version would probably be to define a dedicated Elf_RelRelative
section type in addition to the existing Elf_Rel and Elf_Rela.
Quuxplusone commented 8 years ago

Oh, the ELF slide relocation really is that big? Sorry for being snide, then, and my condolences. That really is something worth bringing up with the ELF committee, but I imagine that'll be a prolonged process. Mozilla's hack might help in the meantime.

Still, I agree that there's independent value in pursuing a variant ABI. Like I said, I think the most promising approach will be to find some way to change v-table layout assumptions in the frontend. The most reliable way to do that is to have an attribute on polymorphic classes which requests the variant ABI for its v-table, and which would propagate to subclasses automatically. You could also have a compiler flag which enables some sort of heuristic approach, like enabling the variant ABI on every class with hidden visibility. Neither of those approaches will covers standard library v-tables like, say, the one in std::function, but that's very hard to do without unsafe assumptions like "we do not need to interoperate with any std::function code compiled separately".

Quuxplusone commented 8 years ago
(In reply to comment #4)
> Oh, the ELF slide relocation really is that big?  Sorry for being snide,
> then, and my condolences.  That really is something worth bringing up with
> the ELF committee, but I imagine that'll be a prolonged process.  Mozilla's
> hack might help in the meantime.

The problem is that it is not just the "slide relocation". The problem is that
every relocation is fully general, with offset, target, addend and type. What
is needed is a specialized representation for the relative/slide relocations,
which in ELF would probably be done with a new relocation section type (the
Elf_RelRelative I mentioned).
Quuxplusone commented 8 years ago
> Still, I agree that there's independent value in pursuing a variant ABI.
Like I said, I think the most promising approach will be to find some way to
change v-table layout assumptions in the frontend.  The most reliable way to do
that is to have an attribute on polymorphic classes which requests the variant
ABI for its v-table, and which would propagate to subclasses automatically.
You could also have a compiler flag which enables some sort of heuristic
approach, like enabling the variant ABI on every class with hidden visibility.
Neither of those approaches will covers standard library v-tables like, say,
the one in std::function, but that's very hard to do without unsafe assumptions
like "we do not need to interoperate with any std::function code compiled
separately".

We have had success with something like your heuristic approach to mark up
externally-visible classes for Chromium's implementation of control-flow
integrity (which only works with non-externally-visible vtables). The approach
is based on class names and attributes, and is controlled by a blacklist. By
default, this blacklist includes the standard library namespaces (std, and
stdext on Windows) as well as classes with the uuid attribute (which indicate
COM classes). This is also the approach that the new -fwhole-program-vtables
feature uses to control its optimization, and the approach that I'll want to
use here as well. This is the default blacklist:

https://llvm.org/svn/llvm-project/cfe/trunk/runtime/vtables_blacklist.txt

and Chromium's blacklist:

https://chromium.googlesource.com/chromium/src/+/master/tools/cfi/blacklist.txt

(In reply to comment #5)
> (In reply to comment #4)
> > Oh, the ELF slide relocation really is that big?  Sorry for being snide,
> > then, and my condolences.  That really is something worth bringing up with
> > the ELF committee, but I imagine that'll be a prolonged process.  Mozilla's
> > hack might help in the meantime.

Chromium for Android uses a similar hack:

https://chromium.googlesource.com/android_tools/+/master/ndk/sources/android/crazy_linker/src/crazy_linker_elf_relocations.cpp

This doesn't really solve the problem of increased memory consumption per-
process though. This is particularly important for Chromium because of its
multi-process architecture.

> The problem is that it is not just the "slide relocation". The problem is
> that every relocation is fully general, with offset, target, addend and
> type. What is needed is a specialized representation for the relative/slide
> relocations, which in ELF would probably be done with a new relocation
> section type (the Elf_RelRelative I mentioned).

While I think we'll want to use the new ABI for vtables because of the memory
consumption issue, we'd probably still be interested in a non-hackish way of
representing the remaining relative relocations for things like constant string
pointers, which account for about 25% of relocations in Chromium.

I wonder whether we can introduce this new relocation section type in a
backwards compatible way, perhaps by having the linker inject code into the
executable/DSO that would apply the relocations if they haven't already been
applied by the dynamic loader.
Quuxplusone commented 8 years ago

I'm not going to support a blacklist approach, sorry.

Quuxplusone commented 8 years ago
(In reply to comment #5)
> (In reply to comment #4)
> > Oh, the ELF slide relocation really is that big?  Sorry for being snide,
> > then, and my condolences.  That really is something worth bringing up with
> > the ELF committee, but I imagine that'll be a prolonged process.  Mozilla's
> > hack might help in the meantime.
>
> The problem is that it is not just the "slide relocation". The problem is
> that every relocation is fully general, with offset, target, addend and
> type. What is needed is a specialized representation for the relative/slide
> relocations, which in ELF would probably be done with a new relocation
> section type (the Elf_RelRelative I mentioned).

I see.  Yes, in that case, I completely agree.
Quuxplusone commented 8 years ago
(In reply to comment #7)
> I'm not going to support a blacklist approach, sorry.

Care to explain why? It's certainly not the cleanest possible design, but it
does seem to be the most pragmatic, and it has been proven by the work on
control flow integrity.
Quuxplusone commented 8 years ago

I generally do not favor approaches that default to miscompiling programs.

I assume you've actually considered the alternatives to a black list. Could you provide the data on them? For example, suppose you had to opt in with an attribute, but that attribute automatically covered subclasses; how many classes in Chromium would you actually have to mark?

Quuxplusone commented 8 years ago
> I generally do not favor approaches that default to miscompiling programs.

The -fwhole-program-vtables flag already has clear guidelines around its use
(admittedly, they may need to be expanded after this change). A program that
does not dynamically link anything other than the standard library (and
remember, the C++ standard says nothing about dynamic linking) would work just
fine with -fwhole-program-vtables.

> I assume you've actually considered the alternatives to a black list.  Could
you provide the data on them?  For example, suppose you had to opt in with an
attribute, but that attribute automatically covered subclasses; how many
classes in Chromium would you actually have to mark?

Note that it's not just a matter of the number of classes. Chromium
incorporates a large number of third-party projects [1] that wouldn't
necessarily want these annotations littering their code base (and for good
reason -- the annotation probably wouldn't be appropriate for those projects).
Even if we maintain the annotations in a chromium-specific fork, that would be
a significant maintenance burden for when we need to roll a project (for this
sort of reason we generally try to stay as close to upstream as possible), and
it would be easy to accidentally miss a class.

Even ignoring third-party libraries, the Chromium codebase is constantly
growing (for better or for worse), and Chromium developers would need to
remember to add an annotation for each new class. Let's say we keep the rule
simple, and declare that all classes without bases must receive the annotation.
I measured the number of base-free classes by grepping for 'class [A-Za-z0-9]*
{' in each of the Chromium 44, 45 and 46 code bases, excluding the third_party
directory. The numbers were 9590, 9917, 10183. Since Chromium releases happen
about every 6 weeks, that's about 50 new classes per week.

By contrast, we've found that the blacklisting approach is much less intrusive,
right >99% of the time, only a minority of developers need to concern
themselves with it, and we can easily catch errors with a good testing
infrastructure.

[1] https://chromium.googlesource.com/chromium/src/+/master/third_party/
Quuxplusone commented 8 years ago

The important number is how many of those classes have virtual functions.

You don't really even need class-specific annotations, necessarily; you could whitelist by namespace.

Quuxplusone commented 8 years ago

The important number is how many of those classes have virtual functions.

Yes, but if we're asking >100 developers to care about this, the rule would need to be as simple as possible to reduce the cognitive burden. I suppose an alternative rule would be "if I use the 'virtual' keyword in the class definition, I need to annotate", but that would be harder to verify visually.

You don't really even need class-specific annotations, necessarily; you could whitelist by namespace.

That would require annotating every namespace declaration in every header file, so it doesn't really buy us much in terms of intrusiveness.

Quuxplusone commented 8 years ago

Your blacklist approach requires you to pass the blacklist to every translation unit; providing a whitelist the same way cannot actually be more difficult.

Quuxplusone commented 8 years ago

So I suppose we're agreed that the annotations would need to be defined outside the source. Let me run some experiments and see how practical it would be to use a namespace-based whitelist.

Quuxplusone commented 8 years ago

Yes, I would be comfortable with an out-of-source whitelist. I would like the implementation to allow explicit opt-in and opt-out in source code, of course, but that doesn't need to be the primary source of information if it's impractical due to scale.

Quuxplusone commented 8 years ago

Another thing: would you be comfortable with a warning (off by default) for classes declared in non-system header files that are unannotated at the source level and end up receiving the system ABI? This would be ideal for Chromium and other projects where the scope of the ABI break really ought to be roughly "everything except std". I reckon we'd probably also want to use the whitelist to control where control-flow integrity is used, so there'd be security implications to letting something slip through the cracks.

Quuxplusone commented 8 years ago

Sure, that seems reasonable to me.

Quuxplusone commented 8 years ago

Based on my experiments I believe that the whitelist idea can be made to work for Chromium (to validate this idea, I had to actually implement the semantic analysis side of this and construct a whitelist for Chromium; it contains 226 entries, which isn't great, but is somewhat reasonably maintainable, especially in combination with the warning I mentioned). I've sent out a more detailed proposal to the mailing list:

http://lists.llvm.org/pipermail/llvm-dev/2016-March/096363.html

Quuxplusone commented 3 years ago

Leonard has been working on this for Fuchsia.