Consider using mergeable section for filenames in coverage mapping

Quuxplusone commented 3 years ago


Bugzilla Link	PR49155
Status	NEW
Importance	P enhancement
Reported by	Petr Hosek (phosek@chromium.org)
Reported on	2021-02-12 00:38:41 -0800
Last modified on	2021-02-22 22:53:39 -0800
Version	trunk
Hardware	PC All
CC	davidxl@google.com, gulfem@google.com, htmldeveloper@gmail.com, i@maskray.me, llvm-bugs@lists.llvm.org, rnk@google.com, roland@hack.frob.com, vk@vedantk.com, vsk@apple.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

In the current coverage mapping format v5, we store mapping to together with the list of filename in the __llvm_covmap section. Since v4, the list of filenames can be also optionally zlib compressed.

The current format has one downside: there can be duplicate filenames in the list of filenames across translation units, in C/C++ this is especially likely for headers, but linker cannot deduplicate these.

This could be improved by adopting a solution similar to DWARF: we could store filenames in a separate mergeable strings section (SHF_MERGE|SHF_STRINGS in ELF), for example __llvm_covstr. This would allow linker to deduplicate strings across translation units. We would also need to modify the coverage mapping to use offsets into the __llvm_covstr section instead of indices.

The only downside of this approach is that we would be no longer able to compress the filenames. It's possible that the size reduction due to deduplication is going to be greater than the one from compression. This is something we should investigate and experiment with.

We could also consider teaching linkers how to automatically compress output sections as a generalization of the approach currently used for DWARF .debug_ sections (for example, when all input sections have the SHF_COMPRESSED flag, the linker could automatically compress the output section).

Quuxplusone commented 3 years ago

We use a similar strategy for coverage function records (see https://reviews.llvm.org/D69471).

Filenames aren't stored in a space-efficient way in the coverage mapping today. However they're relatively small compared to the name, counter, and source region data.

Introducing a more general linker feature for compressing sections sounds interesting: this might lead to greater space savings (and not just for _covmap sections as you point out).

Quuxplusone commented 3 years ago

I've implemented a quick-and-dirty prototype to understand the impact of this change. For llc, the size of the filenames list went from 44M (with compression) to 250K (without compression). It looks like using this approach is a clear win even if we loose compression.

Since we're going to be changing the coverage mapping format version as part of https://reviews.llvm.org/D95753, do you think it'd be worth productionizing this as part of the same change to we avoid changing the format twice (similarly to what was done https://reviews.llvm.org/D69471)?

Quuxplusone commented 3 years ago

Thanks for running the experiment, the savings are more substantial than I expected.

It looks like it's worth it to pursue dedup for filenames.

I'd prefer that the format changes be split up: I feel it'd be easier to review and the cost is just one more bincompat test.

Quuxplusone commented 3 years ago

It turned out that I've made a mistake in my original measurement, the original size is 2.2M (not 44M), and the with the change the size is still 250K, so the ratio is roughly 10x. I think it's still worth pursuing although the improvement it's not as substantial as originally reported.

Quuxplusone commented 3 years ago

Driver -gz passes --compress-debug-sections=zlib to cc1 and linker.
The linker does not automatically turn on --compress-debug-sections=zlib when
all input sections with the name have the SHF_COMPRESSED flag, so I think the
linker should not do automatic compression.

An option like --compress-sections=__llvm_covstr=zlib or --compress-nonalloc-
sections=__llvm_covstr=zlib (with glob patterns to allow '.debug_*=zlib')
probably makes sense. I've created
https://sourceware.org/bugzilla/show_bug.cgi?id=27452 as a placeholder feature
request:) You may start a thread on binutils@sourceware.org linking to that
issue as more folks read the mailing list.

Quuxplusone commented 3 years ago

Thanks for the suggestion, I like the idea of --compress-sections option because it also allows introducing support for different compression algorithms which is something I'd also like to experiment with in the future.

Quuxplusone commented 3 years ago

Thanks, this is a great direction. I think the lesson here is that, for C++, it's always important to optimize metadata formats to enable deduplication before thinking about making them compact. This is something that has always felt like a flaw in the design of DWARF to me after working on CodeView (I apologize if people are tired of hearing me say this).

Quuxplusone commented 3 years ago

@davidxl pointed out that the size of object files is also important for some users which is a great point.

I think we could handle that by using compressed ELF sections (i.e. SHF_COMPRESSED). lld would decompress the input files, merge strings and compress the output. We would need to extend MC to support compression of arbitrary output sections, not just debug sections, but it doesn't seem like it'd be too much effort. What's great about this approach is that the compression becomes completely transparent and could be also applied to other type of instrumentation data.

The only downside I can think is that it doesn't generalize easily to COFF or Mach-O. I'm not very familiar with those formats so we would need to consult with someone to see if similar approach could be used for those.

Quuxplusone commented 3 years ago

It would be really desirable to have a consistent compression support across platforms. We have Chrome on windows building with PGO, so not regressing them would be important.

Quuxplusone / LLVMBugzillaTest

Consider using mergeable section for filenames in coverage mapping #48124