Would Stabilizer actually help on modern hardware?

pca006132 commented 2 years ago

Hi, I saw this fork from the original stabilizer repo and is very interested in it. I wanted to know how much a difference does this make for modern hardware with much larger cache size, more associative ways and better hardware prefetcher. Putting some figures in the readme can help others know that whether this is still important nowadays and attract others to contribute.

Dead2 commented 2 years ago

I do not have good numbers for this, since Stabilizer currently does not entirely work.

But I think a good way to see the problem is to run the same benchmark a bunch of times and look at the avg. Then do a recompile with a minor code change in an unrelated part of the code (like adding some code in error handling that never runs during benchmarks). If you do this with a few changes and rounds of benchmarking, you'll likely notice that your average is sometimes very different even though there should be absolutely no change to the computational load.

For example in zlib-ng we see decompression avg speed changes with code changes only in the compression code, there is no interaction between these codepaths and the code for each is pretty far apart in memory so they never end up in the same cachline. With old machines this would have meant there is no way they effect each other at all. But this is a big issue with benchmarking on modern machines.

What I observe:

The average changes due to changes to code size in code that is never even run.
The min/max changes a lot, even in very controlled circumstances with running on a cpu core that is isolated from OS Scheduler and all IRQ handling that can be moved away from it.

These are problems that are affected by memory placement and cache alignment. This in turn gets affected by:

The size of code blocks (Changes to code or compiler settings/version)
The alignment of code blocks (affected by size of code blocks and linker decisions, function alignment can be artificially increased to somewhat remove the speedup of blocks ending up close to each other)
Code placement in memory (decided by the OS upon loading the program)
Stack/Alloc placement in memory (usually largely decided by the OS as well, but alignments can be forced by application to mitigate some of this)

In my experience these problems are only getting worse the more advanced the CPUs get and the deeper the caches get. Older in-order CPU with less clever tricks like automatic prefetching and caching were a lot easier to get repeatable benchmarks from.

Many benchmarks are very hard to get repeatable results from across multiple commits of code changes, resulting in a proposed speedup sometimes showing a big slowdown during PR review/testing.

Doing 100 runs, discarding the 60 slowest ones and thus only doing an average on the 40 fastest ones seems to be able to mitigate some of the OS-caused variation like memory placement and interrupts. In my own setup with a dedicated benchmarking machine, where I use every trick I know of to ensure repeatable results, I often see +-0.5% variation in the averages. Others can easily see +-1% or more variation in their 100-run average.

Stabilizer, when working properly, should be able to at least nearly eliminate the variation caused by the OS, the linker decisions and code size changes in unrelated code. What we want of course is when you compare two benchmarks you know you are actually comparing the effect of the code change in the hot path instead of a dozen other effects.

I am not sure what I could add to the readme to illustrate this better, feel free to make suggestions. But I hope this helps you and others gain a little bit deeper knowledge of how and why Stabilizer is important. I tried to keep this summary easy to read and understand, as it gets very technical when you start to look at individual cpu core designs for example.

PS: This summary is based on my own research and experience with benchmarking, I have done a lot of research into this myself both in a professional capacity as well as out of a general interest, but this is not a peer-reviewed research article. 😉

pca006132 commented 2 years ago

Thanks for your summary. This is really helpful. Just a random thought: is it possible to somehow modify the linker script to align some specific hot functions to page boundary, if alignment might be causing such performance problem.

Dead2 commented 2 years ago

@pca006132 Yes, that is possible, but quite advanced manual hacking and I'd only really consider that for really small projects. Imagine doing that with Chrome or Firefox for example 😄

Something I did not explain in the above post is the TLB buffer of the CPU.

The effect of a TLB miss is most easily compared to a cacheline miss. Both of these are affected by code/data placement in memory, and both have a performance penalty that often is not the same across multiple compilations (with code size changes mainly).

Aligning functions to a page boundary would mean that each function occupies a separate TLB entry, so while this would avoid the guesswork of what functions live in the same TLB or not, it would also run the risk of running out of TLB buffer entries entirely, potentially leading to even more unpredictable benchmarks. A typical TLB hit is ~1 clock cycle, a typical TLB miss is easily in the 10-100 cycles.

So here too it would be beneficial with Stabilizer, since Stabilizer would automatically randomize code and data placement, thus ensuring that all functions and data stores would be roughly equally "badly" placed in memory on average. This is still not perfect, but a very good approximation.

pca006132 commented 2 years ago

Yes, this approach would probably not scale if we align all functions to page boundary, but it might (a bit might) make our results a bit more deterministic if we can only align those really hot functions, e.g. the tight compression loop you mentioned. Anyway, I guess I need to try stabilizer later when I have time to learn from the actual benchmarks.

magras commented 1 year ago

I have no numbers either, but Stabilizer should greatly hinder compiler optimizations. For example, code randomization de facto disables inlining, which AFAIK is one of the most important optimization techniques because it implicitly enables inter procedure optimization.

I can even imagine that it's possible to construct a patch that will make the code under Stabilizer run slower, while without stabilizer there will be a speedup.

Would Stabilizer with this limitation still be useful for you, @Dead2?

Dead2 commented 1 year ago

@magras I am unclear on what you are suggesting here. Are you suggesting making the code running with Stabilizer artificially even slower than currently? That would probably defeat the purpose of using Stabilizer for benchmarking though, so I think I am missing something here.

Or are you suggesting making it runtime selectable whether to run with/without Stabilizer enabled perhaps?

I think what @pca006132 was suggesting with linker scripts for function alignments is more of an alternative to Stabilizer rather than a modification for it.

magras commented 1 year ago

I'm sorry, I'll try again with more context.

Let me explain how Stabilizer achieves code location randomization. Stabilizer's pass runs before clang's optimization passes and modifies almost every function call to load the target address from a table located right after the end of the calling function. It's similar, but not exactly equivalent to PLT.

Let's assume there is no actual code relocation and no additional runtime costs associated with it, just the code transformation I described above.

Now there is a benchmark measuring this function performance:

int sum(std::vector<int> const& v) {
  int s = 0;
  for (int i : v) {
    s += i;
  }
  return s;
}

and there is an optimized version of sum:

void add(int& s, int i) {
  // do nothing
}

int sum(std::vector<int> const& v) {
  int s = 0;
  for (int i : v) {
    add(s, i);
  }
  return s;
}

I believe all of the big three compilers will optimize the patched version to ret 0 which is obviously faster than the original code, while Stabilizer will add an indirect call to add and probably will make the patched version slower than the baseline.

Yes, this is an artificially constructed situation. But I have doubts about the Stabilizer design because with Stabilizer we are measuring performance of a code very different from the actual release version.

I have troubles with motivation caused by these doubts. Probably it's still worth fixing zlib-ng benchmark crashes and getting actual numbers and first hand experience with Stabilizer, but...

Dead2 commented 1 year ago

@magras Ah, now I understand what you mean.

So the ideal method would probably be for Stabilizer to hook in at some point after the optimizer and inliner has already been run (completely or partially), and only then rewrite function calls and returns. The actual compiler-side implementation of this is beyond me as you know, for now at least.

I think this would clearly be a great benefit. What we really want to benchmark is the code changes (or possibly the optimization flags), and the best way to do that is of course to benchmark on an application that is as close to the "release" compiled as possible.

Small question; Could we run certain optimization passes direcly before/during Stabilizer, for example just the inliner? Then we would do our changes on the already merged functions afterwards. IDK whether that would be possible have run before Stabilizer or whether that would break and require a rewrite (with regards to different kinds of IR etc).

magras commented 1 year ago

So the ideal method would probably be for Stabilizer to hook in at some point after the optimizer and inliner has already been run (completely or partially), and only then rewrite function calls and returns. The actual compiler-side implementation of this is beyond me as you know, for now at least.

@Dead2, it's possible to tap into different stages of the compiler, but it might reduce the effect of Stabilizer. I'll explain in the next post what I mean (point about micro benchmarks).

Btw, I'm not an expert too. I had learned llvm while studying Stabilizer's code.

I think this would clearly be a great benefit. What we really want to benchmark is the code changes (or possibly the optimization flags), and the best way to do that is of course to benchmark on an application that is as close to the "release" compiled as possible.

Inlining isn't the only problem. Right now I know only one technique to fix issues caused by code relocation - deoptimization. TLS, global variables, bulk copy of constants - they all deoptimized. I think their impact is much less than inlining, but I believe I can construct an analogous example for TLS, which will be much closer to a real optimization in a real code.

The are costs associated with Stabilizer's runtime too. Every function call starts with trampoline (push actual function address to the stack and ret to it) and every 500 ms relocation kicks in (there are allocations, deallocations and rewriting relocated functions). Also IIRC Stabilizer's runtime doesn't support multithreading. Fixing multithreading issues while minimizing overhead is hard.

Small question; Could we run certain optimization passes direcly before/during Stabilizer, for example just the inliner? Then we would do our changes on the already merged functions afterwards. IDK whether that would be possible have run before Stabilizer or whether that would break and require a rewrite (with regards to different kinds of IR etc).

I'm not sure how to reorder or duplicate a builtin pass in llvm, but probably it is achievable. There shouldn't be problems with IR. It might change calculated inlining costs, but probably it's fine.

magras commented 1 year ago

Let me paint a broader picture of how I see Stabilizer.

1) I don't think that Stabilizer is viable for micro benchmarks because they often consist of just one function. I doubt that moving it around will affect the result. I think it's possible to randomize layout inside a function by inserting nops, but it requires to tap into the compiler backend (right now stabilizer works at middle end). 2) Stabilizer might help to test changes to small hot code in bigger benchmarks, because there are more functions that can interfere, but without intra function layout randomization results still can be biased. 3) Runtime rerandomization in my opinion is useful primarily in big slow benchmarks because it triggers every 500 ms and have significant cost (allocations, deallocations, rewriting functions).

Hence in my opinion Stabilizer was designed for big projects with long running performance tests. I don't have such projects.

My main grudge is about rerandomization. It breaks compiler assumption about stability of function and global variable addresses and causes all the troubles.

I have thoughts about abandoning code relocation at runtime and randomizing layout only at compile time. Probably with linker scripts, but I never used them and don't know how hard it would be to generate them with random layout.

The obvious advantage of this approach is that we will be measuring the production code without any additional overhead. But of course there are downsides too:

It complicates measurements. Results should be averaged over several builds and runs.
It might be infeasible in big projects with long build and benchmark times.
Stack and heap layout randomization isn't possible to implement in a linker script. I was focused on code randomization and don't know what they actually do. I assume they still require a compiler pass, but absence of runtime rerandomizations will simplify design and probably would allow to get rid of Stabilizer runtime library.

@Dead2, would this approach work for your projects?

Dead2 / stabilizer

Would Stabilizer actually help on modern hardware? #5