dedup_instructions only dedups leaf functions

eseidel commented 1 month ago

I'm testing on 3.4.3, so it's possible this has been changed on main, but I doubt it.

Dart SDK version: 3.4.3 (stable) (Tue Jun 4 19:51:39 2024 +0000) on "macos_arm64"

--dedup_instructions, which is on by default in Release/Product builds attempts to reduce Instructions duplication in the compiled snapshot to save space. It seems to work for leaf functions, but it fails with any non-leaf functions.

This comes up for Shorebird, because duplicate instruction blobs end up confusing our "linker" since it's hard to know just from the instructions (and what it calls) why it's different from another copy within the binary, or why any given caller should call one vs the other. We can probably fix our logic her to be more "caller aware" or find other ways to disambiguate these otherwise identical call trees, but it feels like the better solution is to help upstream's dedup_instructions to be smarter.

The root of the bug is the implementation of dedup. The DedupInstructionsVisitor attempts to dedup "Instructions", but it does so only by comparing the bytes contents of the Instructions objects: https://github.com/dart-lang/sdk/blob/82ffb7d4c8d001b88925778003d76d895be6bb3d/runtime/vm/program_visitor.cc#L1254 https://github.com/dart-lang/sdk/blob/82ffb7d4c8d001b88925778003d76d895be6bb3d/runtime/vm/program_visitor.cc#L331 https://github.com/dart-lang/sdk/blob/82ffb7d4c8d001b88925778003d76d895be6bb3d/runtime/vm/object.h#L5877

Which means that if you ever have a function which all it does is call another function, that "middle man" function will never be caught by this case, since the relative offset for the branch will differ.

Shorebird wrote some smarter "subgraph" walking logic to compute "hashes" for subtrees of instructions (where we look at the target of the branch, ignoring the actual bytes of the offset) which correctly sees these subtrees as duplicates. We'd be happy to work with you to upstream said logic. Our "SubgraphHasher" logic as we call it, is not currently integrated into this deduper, but we might try such.

This "middleman" function is actually quite common it seems. As an example from looking at a recent customer's debug logs we see:

  41  • [Optimized] dispose (edd87fc3)
  25  • [Optimized] _updateTicker (478430d7)
  23  • [Optimized] isSupported (b0671ddd)
  19  • [Optimized] visit (7cde6391)
  17  • [Optimized] deactivate (c4223caf)
  14  • [Optimized] initState (5b6a5964)
  14  • [Optimized] build (fb9befaf)
  13  • [Optimized] toJson (7b3c1960)

Does it matter that the snapshot carries 41 identical copies of "dispose" (all of which call the same underlying function, just using a different relative branch), probably not. But there probably is some non-trivial snapshot size savings available here.

Here is an example dump of the analyze_snapshot output (we've modified analyze_snapshot slightly in our fork, but it's the same general idea) of an example duplicate (ignore the noops in the disassembly dump, that's a hack we have for the moment in our code). You can see all of these just call the same sub function (even just one copy of such!) just at different relative offsets:

[  
  {
    "name": "[Optimized] computeDryLayout",
    "index_in_entries": 10221,
    "offset": 2440348,
    "size": 56,
    "disassembly": [
      "stp fp, lr, [sp, #-16]!",
      "mov fp, sp",
      "ldr tmp, [thr, #56]",
      "cmp sp, tmp",
      "bls +32",
      "add r3, pp, #0x3e000",
      "ldr r3, [r3, #3144]",
      "nop",
      "bl +24",
      "mov sp, fp",
      "ldp fp, lr, [sp], #16 !",
      "ret",
      "bl +10243264",
      "b -32"
    ]
  },
  {
    "name": "[Optimized] computeDryLayout",
    "index_in_entries": 10223,
    "offset": 2440728,
    "size": 56,
    "disassembly": [
      "stp fp, lr, [sp, #-16]!",
      "mov fp, sp",
      "ldr tmp, [thr, #56]",
      "cmp sp, tmp",
      "bls +32",
      "add r3, pp, #0x3e000",
      "ldr r3, [r3, #3144]",
      "nop",
      "bl -356",
      "mov sp, fp",
      "ldp fp, lr, [sp], #16 !",
      "ret",
      "bl +10242884",
      "b -32"
    ]
  },
  {
    "name": "[Optimized] computeDryLayout",
    "index_in_entries": 10224,
    "offset": 2440784,
    "size": 56,
    "disassembly": [
      "stp fp, lr, [sp, #-16]!",
      "mov fp, sp",
      "ldr tmp, [thr, #56]",
      "cmp sp, tmp",
      "bls +32",
      "add r3, pp, #0x3e000",
      "ldr r3, [r3, #3144]",
      "nop",
      "bl -412",
      "mov sp, fp",
      "ldp fp, lr, [sp], #16 !",
      "ret",
      "bl +10242828",
      "b -32"
    ]
  }
]

I still don't understand enough about the Dart compiler flow to understand why so many copies of these are being created? For example there is only ever one source copy of _updateTicker in all of Flutter (it's a 3 line function), but 25 compiled copies get created in this. I'm not sure why the compiler doesn't know how to canonicalize as it's going and prevent re-compiling a new copy every time. 🤷

dart-github-bot commented 1 month ago

Summary: The --dedup_instructions flag, which is enabled by default in release builds, is only deduplicating leaf functions, not non-leaf functions. This results in duplicate instruction blobs for functions that simply call other functions, which can confuse the Shorebird linker. The issue is caused by the deduplication logic only comparing the byte contents of instructions, not considering the function call targets.

eseidel commented 1 month ago

You can repro this for yourself with just flutter create and analyze_snapshot (which in upstream dart I believe is only built for linux/android, although happy to work with someone to upstream our changes to have it built everywhere and support macho files too).

You'll notice when you run analyze_snapshot on binary built from the unmodified flutter create you will see 2 identical copies of computeDryLayout and 2 identical copies of updateTicker, differing only in the relative offset needed in the branch instruction.

eseidel commented 1 month ago

We have a whole "SubgraphHasher" infrastructure (which I'm happy to share/upstream), which knows how to walk starting at a given Instructions payload_start and follow all branches (looking up the equivalent Code objects). However that requires an InstructionsTable (as it's currently written) to be able to look up from a given PC to the start/stop of the Instructions surrounding it. I don't know how to get the equivalent information from within DedupInstructions (since I don't think "InstructionsTable" is runtime-only concept and does not exist during compile?). But if someone wanted to point me as to what I'd use, SubgraphHasher already has all those calls abstracted and it would be very easy for us to plug it in here and just keep "subgraph hashes" for all Instructions objects and use those as both the Equals and Hash values while doing dedup.

eseidel commented 1 month ago

It's possible that the correct fix isn't on the dedup side, but rather on whatever side of the compiler which is making so many copies of these little functions? Presumably lots of copies are created during compilation and then assumed to be deduped later? But for example for _updateTicker (a common occurrence in reports I've seen), there is exactly one copy of that function in the source:

  void _updateTicker() {
    if (_ticker != null) {
      _ticker!.muted = !_tickerModeNotifier!.value;
    }
  }

that may end up compiled many many times into separate instruction blobs. In this case it doesn't matter a ton since the function is so short, but some of these "middleman" functions might not end up being so short?

rmacnak-google commented 1 month ago

When instruction deduplication was added, all calls between functions were indirect, so a simple bytewise comparison handled everything. Later, "bare instructions mode" made static calls into direct pc-relative calls. As you describe, this causes only leaf functions to be deduplicated. My expectation has been that making the dedupper smarter to recognize different pc-releative-offsets to the same target as deduppable would not result in much additional dedupping, but I never implemented it to measure.

why so many copies of these are being created?

This function is defined in a mixin. Dart's implementation of mixins is based on copying members into mixin applications. (Unlike Strongtalk's implementation of mixins, where mixin applications point to mixins. In a copy-down implementation, super calls are static calls. In a shared implementation, super calls are dynamic calls.)

a-siva commented 1 month ago

The general sentiment seems to be that we may not see much more reduction in code size by expanding the de-duplication to non leaf functions.

felangel commented 1 month ago

The general sentiment seems to be that we may not see much more reduction in code size by expanding the de-duplication to non leaf functions.

Thanks for the update! For what it's worth, we've observed ~500 collisions in larger Flutter apps where a collision is defined as 2 or more duplicate code objects.

Some examples include:

42 copies of dispose
25 copies of _updateTicker
15 copies of initState
13 copies of toJson
8 copies of computeMinIntrinsicHeight
8 copies of computeMaxIntrinsicWidth

felangel commented 1 month ago

@rmacnak-google @a-siva I was trying to make a pure Dart reproduction of this and had a hard time. If you have any tips/suggestions for what a test case would look like, I'd really appreciate it since I've only be able to reproduce this in the context of a Flutter application, thanks! 🙏

rmacnak-google commented 1 month ago

int foo = 0;

mixin M {
  @pragma("vm:never-inline") stem1() => stem2();
  @pragma("vm:never-inline") stem2() => stem3();
  @pragma("vm:never-inline") stem3() => stem4();
  @pragma("vm:never-inline") stem4() => stem5();
  @pragma("vm:never-inline") stem5() => leaf();
  @pragma("vm:never-inline") leaf() => foo++;
}

class S1 {}
class S2 {}
class S3 {}
class S4 {}
class S5 {}

class MA1 extends S1 with M {}
class MA2 extends S2 with M {}
class MA3 extends S3 with M {}
class MA4 extends S4 with M {}
class MA5 extends S5 with M {}

@pragma("vm:never-inline")
bar(M m) => m.stem1();

main() {
  bar(new MA1());
  bar(new MA2());
  bar(new MA3());
  bar(new MA4());
  bar(new MA5());
}

results in 5 copies of the stem functions that are not dedup'd, but the 5 copied of the leaf function are dedup'd.

felangel commented 1 month ago

int foo = 0;

mixin M {
  @pragma("vm:never-inline") stem1() => stem2();
  @pragma("vm:never-inline") stem2() => stem3();
  @pragma("vm:never-inline") stem3() => stem4();
  @pragma("vm:never-inline") stem4() => stem5();
  @pragma("vm:never-inline") stem5() => leaf();
  @pragma("vm:never-inline") leaf() => foo++;
}

class S1 {}
class S2 {}
class S3 {}
class S4 {}
class S5 {}

class MA1 extends S1 with M {}
class MA2 extends S2 with M {}
class MA3 extends S3 with M {}
class MA4 extends S4 with M {}
class MA5 extends S5 with M {}

@pragma("vm:never-inline")
bar(M m) => m.stem1();

main() {
  bar(new MA1());
  bar(new MA2());
  bar(new MA3());
  bar(new MA4());
  bar(new MA5());
}

results in 5 copies of the stem functions that are not dedup'd, but the 5 copied of the leaf function are dedup'd.

Thanks!

eseidel commented 1 month ago

We've found a common cause of this. package:intl generates lots of methods "m1", "m2" etc. Which end up in this case due to use of formatting strings. I don't have concrete numbers as to how much snapshot size this causes. 🤷🏼‍♂️

mraleph commented 1 month ago

@rmacnak-google

When instruction deduplication was added, all calls between functions were indirect, so a simple bytewise comparison handled everything. Later, "bare instructions mode" made static calls into direct pc-relative calls. As you describe, this causes only leaf functions to be deduplicated. My expectation has been that making the dedupper smarter to recognize different pc-releative-offsets to the same target as deduppable would not result in much additional dedupping, but I never implemented it to measure.

I am fairly certain that this is not related to bare instructions - unrelocated relative calls are all using the same (0) distance and thus are all bytewise equal. See assembler. Actual relocation will happen after deduplication.

The core of the issue is not how we compare instructions but how we compare / deduplicate static calls: we should process functions using DFS over call graph (so that callees are processed before callers) and relink static_calls_target_table as we go and dedup it once all callees are processed. This would not handle recursive functions (they require more sophisticated comparison algorithm), but would handle other cases.

eseidel commented 4 weeks ago

FWIW, we've added some logging to our tooling to report how much of a snapshot size increase this causes. So far it's been minimal in examples we've seen (e.g. ~0.2%), but we aren't collecting data from the wild, just from the few snapshots which get sent into us for other debugging purposes. If we see a big jump we'll post here.

Mostly this has been top of mind because we're having to write some code to work around the duplicates. (We used to always ignore all duplicates in our system, but that's not good enough, so now we're writing some special handling to figure out which ones are used by what.). It's all fine, we've had to write lots of other code in the Dart VM to work around other quirks of Dart which are only a problem for Shorebird. :)

dart-lang / sdk

dedup_instructions only dedups leaf functions #56262