Test results more consistently

gurgunday commented 6 months ago

https://github.com/nodejs/performance/issues/166#issuecomment-2094019957 suggests that some results are unreliable, if dead code elimination is indeed taking place, we're not getting correct results for the operations

We should investigate and find a way to get results that reflect reality

RafaelGSS commented 6 months ago

dead code elimination is taking place. The idea of this repository is to compare two versions of Node.js performing pretty much the same operation. If version X contains a dead-code elimination (noop) and X+1 doesn't, it should inform. The percentage is irrelevant in microbenchmarking. If we include operations to remove dead code elimination (for instance, %DoNotOptimize) it won't give us the reality.

The idea is pretty simple, these operations have changed in Node.js 22 for that specific and unrealistic workload.

gurgunday commented 6 months ago

I see, yeah I'm not disagreeing, it actually fulfilled its purpose by showing that change

But I wasn't alluding to %DoNotOptimize, maybe just small modifications to certain tests so that we can get the actual operations' performance, and v8 can still optimize as much as it can, but doesn't eliminate the whole operation

Small modifications can get us there:

https://github.com/nodejs/performance/issues/166#issuecomment-2094019957

RafaelGSS commented 6 months ago

I agree! I also suggest moving to another benchmark tool. tinybench seems more reliable in my last research.

eugene1g commented 6 months ago

The idea of this repository is to compare two versions of Node.js performing pretty much the same operation.

IMO the problem is that when/if dead-code-elimination is at play, the current architecture for benchmarks does not measure the operation. If my application does millions of [].includes(), the benchmark results say that this logic will run ~10x slower in v22 than in v20, but in reality there will be no difference at all as my app code doesn't get eliminated. If the elimination theory is true, then current benchmark results are wrong because the benchmarking code calling [].includes() did not get executed all under v20 but did in v22. Such test behavior tells me nothing about how fast that operation runs in v20 vs v22.

RafaelGSS commented 6 months ago

Is there a way to see if the code is reaching dead-code elimination? I think it's fair to consider all code to be optimized by V8 (Maglev/Turbofan). We should only make sure that we are not comparing a noop with a real operation optimized.

cc: @joyeecheung

joyeecheung commented 6 months ago

Is there a way to see if the code is reaching dead-code elimination?

With TurboFan could figure it out by doing --trace-turbo and use tools/turbolizer in V8 and look at the graph changes - the elimination is done on basic blocks so it can be difficult to map it back to the original code, depending on how complex your code is. With Maglev I guess some of the --trace-maglev-* flags allows you to figure it out too but I've never tried them myself.

To avoid dead code elimination IMO the best way is just to make your benchmarks more sophisticated. If it's too simple and unrealistic you are basically trying to outsmart the optimizing compilers so that they fail to notice that the code isn't doing anything even though a human can figure out it's not doing anything. But optimizing compilers are designed to be smart enough for this (at least Turbofan is).

Maglev is dumber because it tries to cut corners in optimization under the assumption that most code doesn't need aggressive optimizations, which can be expensive on their own, while compilation speed of said code may already play a more important role in the overall performance. In the real world, a lot of code doesn't actually get hot enough to be worthy of expensive optimizations. Trying to measure how they perform when they are hot may have already been missing the point if the user code doing the pattern isn't hot enough and will only ever be handled by the interpreter or the baseline compiler/mid-tier optimizing compiler, and the compilation of the code takes more time in the application than actually executing said code (which can happen more frequently in CLIs).

RafaelGSS commented 1 month ago

Could be fixed by: #256.

gurgunday commented 1 month ago

Thank you so much!

What do you think of assigning a result of an operation to a global var btw? It seems to fix a lot of the code elimination issues for me when benching

Something like:

let result = "";

const bench = new Bench({ time: 500 });

bench.add("simple HTML formatting", () => {
  result = html`<div>Hello, world!</div>`;
});

bench.add("null and undefined expressions", () => {
  result = html`<p>${null} and ${undefined}</p>`;
});

RafaelGSS commented 1 month ago

It's not guaranteed that it will fix code elimination, other optimizations might take place. I was discussing at v8-mail and a feasible (but not realistic way) is to use %NeverOptimizeFunction for that. bench-node by default, compiles your benchmark using %NeverOptimizeFunction. That's why I said: "_could be fixed by ...". Although, in my tests, it doesn't guarantee a non-optimization as well.

RafaelGSS commented 1 month ago

I'll analyse the results of https://github.com/RafaelGSS/nodejs-bench-operations/commits/main/?before=b70c29c9c19936e25c7a521682ca52af49390f0f+35&after=b70c29c9c19936e25c7a521682ca52af49390f0f and compare with other tools.

RafaelGSS / nodejs-bench-operations

Test results more consistently #74