davidlattimore / wild

Apache License 2.0
530 stars 14 forks source link

Wild performance with debug info support #117

Open marxin opened 2 weeks ago

marxin commented 2 weeks ago

Once the recent support for DWARF has been added, we're falling behind mold linker when it comes to linking of something huge (like clang):

❯ bloaty ../../../../bin/clang-20
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  74.5%  3.31Gi   0.0%       0    .debug_info
   9.4%   427Mi   0.0%       0    .debug_loclists
   5.1%   231Mi   0.0%       0    .debug_str
   4.9%   220Mi   0.0%       0    .debug_line
   2.0%  91.3Mi  61.6%  91.3Mi    .text
   1.7%  76.4Mi   0.0%       0    .debug_rnglists
   0.9%  41.5Mi  28.0%  41.5Mi    .rodata
   0.5%  23.6Mi   0.0%       0    .debug_abbrev
   0.5%  21.7Mi   0.0%       0    .strtab
   0.2%  9.01Mi   6.1%  9.01Mi    .eh_frame
   0.1%  4.89Mi   0.0%       0    .symtab
   0.1%  4.48Mi   3.0%  4.48Mi    .data
   0.1%  3.64Mi   0.0%       0    .debug_aranges
   0.0%  1.09Mi   0.7%  1.09Mi    .eh_frame_hdr
   0.0%       0   0.5%   704Ki    .bss
   0.0%   365Ki   0.0%       0    .debug_line_str
   0.0%  43.4Ki   0.0%  43.4Ki    .rela.dyn
   0.0%  14.8Ki   0.0%  14.4Ki    [17 Others]
   0.0%  8.21Ki   0.0%  8.21Ki    .dynstr
   0.0%  7.62Ki   0.0%  7.62Ki    .dynsym
   0.0%  5.77Ki   0.0%  5.77Ki    .got
 100.0%  4.44Gi 100.0%   148Mi    TOTAL

> wild ...
┌───    2.43 Open input files
├───    1.62 Split archives
├───    3.53 Parse input files
│ ┌───    2.76 Read symbols
│ ├───    7.66 Populate symbol map
├─┴─   11.47 Build symbol DB
│ ┌─── 1719.15 Resolve symbols
│ ├───    1.25 Assign section IDs
│ ├─── 2206.91 Merge strings
│ ├───    0.04 Canonicalise undefined symbols
│ ├───    6.99 Resolve alternative symbol definitions
├─┴─ 3934.42 Symbol resolution
│ ┌───   28.55 Find required sections
│ ├───    0.00 Merge dynamic symbol definitions
│ ├───    2.53 Finalise per-object sizes
│ ├───    0.02 Apply non-addressable indexes
│ ├───    0.15 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.30 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───    5.93 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─   37.67 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    0.66 Split output buffers by group
│ ├─┴─ 2962.53 Write data to file
│ ├───    3.33 Sort .eh_frame_hdr
├─┴─ 2965.95 Write output file
│ ┌───  180.89 Unmap output file
│ ├───   22.78 Drop layout
│ ├───    0.43 Drop symbol DB
│ ├───  243.71 Drop input data
├─┴─  447.86 Shutdown
└─ 7513.08 Link

real    0m7.684s

Mold runs for:

real    0m3.053s
user    0m0.024s
sys 0m0.008s

As seen, a huge part of the linking is spent in the single-threaded Merge strings, where we have to deal with the following huge section:

2024-09-10T08:42:31.369525Z DEBUG Link:Symbol resolution:Merge strings: wild_lib::resolution: merge_strings section=.debug_str size=242491445 sec.totally_added=6249233166 strings=2600883

That is 231MiB where the total input size before deduplication is 6GiB.

Related to: #37.

marxin commented 2 weeks ago

If I invoke Mold with -Wl,--compress-debug-sections=zstd, the linker runs for about 4.1s and the output binary shrinks to:

❯ bloaty ../../../../bin/clang-20
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  78.7%  1.61Gi   0.0%       0    .debug_info
   5.4%   112Mi   0.0%       0    .debug_loclists
   4.6%  95.8Mi  59.8%  95.8Mi    .text
   2.6%  54.5Mi   0.0%       0    .debug_str
   2.5%  53.1Mi   0.0%       0    .debug_line
   1.8%  37.9Mi  23.7%  37.9Mi    .rodata
   1.2%  25.3Mi   0.0%       0    .strtab
   1.2%  24.7Mi   0.0%       0    .debug_rnglists
   0.5%  9.72Mi   6.1%  9.72Mi    .eh_frame
   0.4%  8.50Mi   0.0%       0    .symtab
   0.3%  6.00Mi   0.0%       0    .debug_abbrev
   0.2%  4.49Mi   2.8%  4.49Mi    .dynstr
   0.2%  4.22Mi   2.6%  4.22Mi    .data.rel.ro
   0.1%  1.81Mi   1.1%  1.81Mi    .rodata.str1.1
   0.1%  1.54Mi   1.0%  1.54Mi    .rodata.str1.8
   0.1%  1.30Mi   0.8%  1.30Mi    .dynsym
   0.1%  1.24Mi   0.8%  1.24Mi    .eh_frame_hdr
   0.1%  1.06Mi   0.6%  1015Ki    [33 Others]
   0.0%   839Ki   0.0%       0    .debug_aranges
   0.0%       0   0.4%   715Ki    .bss
   0.0%   444Ki   0.3%   444Ki    .hash
 100.0%  2.04Gi 100.0%   160Mi    TOTAL
davidlattimore commented 2 weeks ago

We discussed (via a video call) the possibility of splitting string merge sections into buckets based on part of the hash of the string. We could then build a hashmap per bucket and also do writing per bucket. One issue is that in order to ensure deterministic behaviour, the strings in the bucket would need to be sorted prior to assigning indexes to the strings. Another potential issue is that it might not play nicely with caches, since adjacent strings from the input files will likely end up in different buckets. Even if they don't, they're not likely to get written in order.

Another possibility is to use a concurrent hashmap. We can't allocate offsets to merged strings like we do now when we insert into the merged-strings hashmap because that would result in non-determinism. We could instead store as the value, which input file was selected to "own" that string. Later, when another file is found to have the same string, it might or might not replace the previous "owner" of the string depending on some deterministic selection. This has the advantage that a file can then write all the strings that it owns in the order in which they appear in the input file, which is likely to give better cache performance. If a file owns multiple adjacent strings, then it could potentially write them all with a single memcpy.

marxin commented 1 week ago

After the recent changes, we get to:

┌───    1.82 Open input files
├───    1.13 Split archives
├───    3.78 Parse input files
│ ┌───    2.90 Read symbols
│ ├───    4.71 Populate symbol map
├─┴─    8.80 Build symbol DB
│ ┌─── 1238.77 Resolve symbols
│ ├───    1.81 Assign section IDs
│ ├───  659.97 Merge strings
│ ├───    0.03 Canonicalise undefined symbols
│ ├───    6.73 Resolve alternative symbol definitions
├─┴─ 1907.36 Symbol resolution
│ ┌───   46.31 Find required sections
│ ├───    0.00 Merge dynamic symbol definitions
│ ├───    2.15 Finalise per-object sizes
│ ├───    0.03 Apply non-addressable indexes
│ ├───    0.17 Compute total section sizes
│ ├───    0.00 Compute segment layouts
│ ├───    0.00 Compute per-alignment offsets
│ ├───    0.35 Compute per-group start offsets
│ ├───    0.00 Compute merged string section start addresses
│ ├───    5.78 Assign symbol addresses
│ ├───    0.00 Update dynamic symbol resolutions
├─┴─   55.12 Layout
│ ┌───    0.00 Wait for output file creation
│ │ ┌───    1.21 Split output buffers by group
│ ├─┴─ 2265.57 Write data to file
│ ├───    3.67 Sort .eh_frame_hdr
├─┴─ 2269.36 Write output file
│ ┌───  158.30 Unmap output file
│ ├───   19.11 Drop layout
│ ├───    0.27 Drop symbol DB
│ ├───  134.21 Drop input data
├─┴─  311.95 Shutdown
└─ 4581.64 Link

real    0m4.746s
user    1m12.046s
sys 0m13.231s
davidlattimore commented 1 week ago

I assume there's still plenty of opportunity for further optimisation with writing of merged strings? I'd also be interested to try the concurrent hashmap approach that I mentioned above at some point and see how it compares.

I should probably do a debug build of clang myself so that I can run this benchmark too.

marxin commented 1 week ago

I assume there's still plenty of opportunity for further optimisation with writing of merged strings?

I hope there's still room for improvement :)

I'd also be interested to try the concurrent hashmap approach that I mentioned above at some point and see how it compares.

I am really curious if it can help us. We might want to start fill-up the concurrent hash map as early as possible, ideally in the find_required_sections function. I was also thinking about storing a HashMap<u64,64> in MergeStringsFileSection which will keep mapping from an offset within a MergeStringsFileSection to the offset in the final merged string section. Having that map we won't have to build the StringToMerge instances in get_merged_string_output_address.

I should probably do a debug build of clang myself so that I can run this benchmark too.

It's pretty simple, just clone the repository and run:

mkdir build
cd build
cmake ../llvm -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_PROJECTS=clang
make clang -j123
davidlattimore commented 1 week ago

Wow, debug info really makes a pretty massive difference doesn't it. With the same input files, linking clang on my laptop with --strip-debug takes 300 ms. Without --strip-debug it takes 16 to 19 seconds. I even saw some runs take as long as 27 seconds. I'm not sure why the variation is so large. Profiles seem to show that during writing there are periods of up to a second when none of the threads appear to be doing anything. Peak RSS seems to be around 24G, which is a lot, but still well within the 42G of RAM in my laptop.

I was also thinking about storing a HashMap<u64,64> in MergeStringsFileSection which will keep mapping from an offset within a MergeStringsFileSection to the offset in the final merged string section. Having that map we won't have to build the StringToMerge instances in get_merged_string_output_address.

That sounds like it could help quite a bit. Certainly get_merged_string_output_address is showing up as a pretty substantial proportion of the runtime.

I'll hold of making any optimisations in this area while you are, so as to avoid merge conflicts.

marxin commented 1 week ago

Wow, debug info really makes a pretty massive difference doesn't it. With the same input files, linking clang on my laptop with --strip-debug takes 300 ms. Without --strip-debug it takes 16 to 19 seconds.

Oh yeah, debug info tends to be pretty large and brings very many relocations and a huge pool of strings from the string-merge section. The described delay might be related to a disk congression where you can't effectively utilize your CPU, right?

That sounds like it could help quite a bit. Certainly get_merged_string_output_address is showing up as a pretty substantial proportion of the runtime.

I'll hold of making any optimisations in this area while you are, so as to avoid merge conflicts.

Feel free to do so. I've tried a couple of experiments and no one seemed as a promising one :) Right now, I don't have any temporary patchset that I would need to rebase.

davidlattimore commented 1 week ago

I added a cache for merged string lookups - somewhat similar to what you described. It did indeed improve performance. Link time on my laptop for a debug build of clang reduced by about 39%. I'm going to experiment with a concurrent hashmap to see if we can reduce the time further - that's a more drastic change though.

marxin commented 1 week ago

Oh, great, the cache basically aligns with my idea. I can confirm to the improvement on my machine:

❯ hyperfine 'g++ @cmd -B ~/Programming/wild' 'g++ @cmd -B /tmp/'
Benchmark 1: g++ @cmd -B ~/Programming/wild
  Time (mean ± σ):      4.626 s ±  0.242 s    [User: 72.387 s, System: 12.711 s]
  Range (min … max):    4.288 s …  4.988 s    10 runs

Benchmark 2: g++ @cmd -B /tmp/
  Time (mean ± σ):      3.650 s ±  0.118 s    [User: 36.700 s, System: 26.740 s]
  Range (min … max):    3.445 s …  3.811 s    10 runs

Summary
  g++ @cmd -B /tmp/ ran
    1.27 ± 0.08 times faster than g++ @cmd -B ~/Programming/wild

The latter one is the current main with the caching. I'm really curious about the concurrent hashmap.

davidlattimore commented 6 days ago

I got things working with the concurrent hashmap. I used dashmap. It was a pretty substantial change that added a moderate amount of complexity. I would have been OK with that if it has sped things up, but it ended up slowing string merging moderately, so I'm going to give up on that approach. I've got more ideas that I'd like to try out, however I should probably prepare my talk for gosim 2024 first, so it might be awhile before I get back to this.

marxin commented 5 days ago

Quite interesting results. Would you mind sharing the git branch with the concurrent hashmap?

davidlattimore commented 5 days ago

Sure, I've uploaded it here: https://github.com/davidlattimore/wild/compare/string-merge-concurrent-hashmap?expand=1

marxin commented 1 day ago

I've just measured the performance for Clang compiler build (w/ debug info), and Wild after 54adc37cea73291dc817efa0b3ef96b9e3a9942c is about 15% slower than before the revision which added the caching for string merging section (0282c13a8afb14cf86ebd2d0632483e7670b87b8).

What shall we do, @davidlattimore?

davidlattimore commented 1 day ago

Thanks for noticing that. You're right. I've removed the string-merge caching and saw a speedup. I think the speedup I got on my laptop was closer to 5%. I guess we were getting good performance before because we were getting such a good cache hit rate and we were getting a good cache hit rate because there were lots of cache hits that shouldn't have been cache hits.

We're possibly still a little bit slower than we were before the caching was introduced. At first I thought that maybe this was my change to resolve sections separately from resolving symbols, but now I don't think this is actually the case.

Looking at the output of samply, I'm seeing a period of time when all the threads are idle except the one thread that's deleting the old output file. It wasn't like this before. I think what's happening, is symbol resolution tries to spawn N tasks, where N is the size of our thread pool. One of those tasks can't start because there's a thread being used for file deletion. I'm unsure how best to fix that. A few ideas are: