facebookarchive / BOLT

Binary Optimization and Layout Tool - A linux command-line utility used for optimizing performance of binaries
2.51k stars 176 forks source link

[Question] How can l remove useless segments? #242

Closed getianao closed 2 years ago

getianao commented 2 years ago

Hi, According to #93 , .bolt.org.text can be useless If the binary links with --emit-relocs and not specify -lite.

But I try to remove.bolt.org.text segment of my program which only contains a simple loop using objcopy, and the stripped program can't work. It prints Inconsistency detected by ld.so: rtld.c: 1494: dl_main: Assertion 'GL(dl_rtld_map).l_libname' failed! How can I fix it?

And can you give me more advise to reduce the generated binary's code-size?Thanks :)

More details about the program:

%.o: %.C
    $(CXX) $(OPT)  -c $< -o $@ $(CFLAGS)

$(BIN):$(LINKEDBIN)
    $(CXX) $(OPT)     $< -o $@ $(CFLAGS) -Wl,--emit-relocs

bolt: $(BIN)
    ${BOLT} $(BIN) -instrument -o $(BIN).ins
    ./$(BIN).ins

test:
    ${BOLT} $(BIN) -o $(BIN).bolt -data=$(FDATA_PATH)/prof.fdata -reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=2 -split-all-cold -split-eh -dyno-stats --no-huge-pages  -v=1

strip:
    objcopy --remove-section .bolt.org.text  $(BIN).bolt $(BIN).bolt.strip
maksfb commented 2 years ago

My guess is that objcopy cannot properly remove allocatable sections.

Typically, larger binary should not be a problem from the performance perspective. By default, we align code at huge page boundary (2MB) which increases the size but will result in lower TLB misses if you combine it with huge page usage (enabled by -hugify in BOLT). If you want BOLT to generate a smaller output binary, you can try adding -use-old-text and -no-huge-pages options (I see you are already using the latter). Watch out the output. It should have BOLT-INFO: using original .text for new code with 0x1000 alignment message. If not, then due to alignment issues BOLT couldn't place the new code in the original .text.

getianao commented 2 years ago

Thanks for your prompt response.

I tried use-old-text option, but get 'BOLT-WARNING: original .text too small to fit the new code using 0x1000 alignment. 8064 bytes needed, have 565 bytes available.'

Does it means the program spilt the .text segment into two to make the use of the created gap, or the old .text segment is useless and the new optiomized .text segment added at the end as a new segment.

maksfb commented 2 years ago

The latter. Since the program (.text) is smaller than the smallest alignment we use for code (4K) we couldn't fit new code. Plus we likely try to add code from other code sections like .init and .fini.

vleonen commented 2 years ago

@maksfb @rafaelauler Currently, I have such an implementation that removes .bolt.org.text and splits the old LOAD segment containing this section into two segments with further updating offsets in later sections to degrease output binary size. For binaries with a really big .text section - it significantly reduces output binary size. However, this implementation looks a bit weird because it requires modifying sections offsets in set of places and requires space for additional segment entry in PHDR table, also it requires -lite=0 mode to be set. I'm wondering if Bolt team has some plans or ideas on how to improve output of ELF binary layout. E.g. to combine all RW / RO / RX sections into corresponding LOAD segments instead of creating additional segments with sections modified by Bolt. Maybe you can share some known issues that block such support? We would like to improve it, so we would like to align with Bolt team on possible ideas on such implementation.

getianao commented 2 years ago

@vleonen That's a great idea to improve the binary layout. In fact, I'm working on binary size reduction with bolt. Could you share more details about it and I will be glad to help if you need.

maksfb commented 2 years ago

@vleonen With relocations in the binary, it should be possible to relocate most sections (as units) and regroup into segments. I'm sure it sounds easier than it will likely look in practice, but it's worth a shot. You will have to teach BOLT to process more relocations efficiently. We are not currently working on this, as the binary size is not the highest priority. Still, it will be undoubtedly a welcome addition to BOLT, especially in scenarios like @getianao mentions.

For the lite mode, we will have to make functions relocatable without fully disassembling them and storing CFG. There are a couple of caveats there as we cannot fully rely on the linker relocations, but again - it should be doable, except maybe for cases where we cannot reliably disassemble a function.

yota9 commented 2 years ago

@maksfb BTW is there any reasons to support binaries without relocations? It needs extra code, extra tests & etc, but I'm not sure if there are any reason to support this, since the optimization effects are limited. What do you think about it?

maksfb commented 2 years ago

@maksfb BTW is there any reasons to support binaries without relocations? It needs extra code, extra tests & etc, but I'm not sure if there are any reason to support this, since the optimization effects are limited. What do you think about it?

For optimizing in-house applications, it does make sense to support the relocation mode only. However, there are other applications, beyond optimizations that make the non-relocation mode useful. E.g. being able to look at CFGs of functions in arbitrary binaries. I imagine there will be more applications for this mode as BOLT framework for binary analysis grows.

getianao commented 2 years ago

@maksfb Hi, about the issue with objcopy mentioned at the beginning, I tried to enable use-gnu-stack option and then the stripped program works well.

So creating PHDR table in a segment may be the reason why objcopy can't work with the generated binary, is it right? And what are the downsides if I enable use-gnu-stack option.

getianao commented 2 years ago

@maksfb And another question is about identical code folding: If I enable -use-old-text option and successfully place the new code in the original .text, although some identical fucntions have been folded, the size of original .text segment doesn' change, so it doesn't affect binary size much and there may be a small size reduction due to the reduction of symbols and strings. If my understanding is correct, is there any way that can change section's size so that useless empty space in section can be avoid? I'm curious about it and really appreciate for your patient answer :)

maksfb commented 2 years ago

@maksfb Hi, about the issue with objcopy mentioned at the beginning, I tried to enable use-gnu-stack option and then the stripped program works well.

So creating PHDR table in a segment may be the reason why objcopy can't work with the generated binary, is it right? And what are the downsides if I enable use-gnu-stack option.

Yes, binutils objcopy cannot deal with a program header table in the new segment. -use-gnu-stack option reuses another entry (GNU_STACK) in the table for the new segment.