Alex313031 / Thorium-Win-AVX2

Repo to serve AVX2 Windows builds of Thorium. https://github.com/Alex313031/Thorium/
https://thorium.rocks/
BSD 3-Clause "New" or "Revised" License
361 stars 9 forks source link

Performance comparision #11

Closed dabugen closed 2 years ago

dabugen commented 2 years ago

I did a 30-minute benchmark (all 3 instances started in sync) with the app EA Studio that I am running 24/7 in Chromium and hence always test for the fastest builds. Interesting results can be seen in the screenshot. Just thought I´d share. Thanks for the good work and steady performance improvements! Untitled

dabugen commented 2 years ago

After some more hours of benchmarking, your 103.0.5014.0 build is still ahead of all the other builds, interestingly also compared to the build of justclueless that´s also AVX2 and the same major version (103) and actually should also be using the same optimizations like yours. Yet it´s a good bit behind in performance. But also your old 102.0.4973 is slower, on the same level of justclueless build of 103. It´s very interesting, especially since I can rule out any benchmark issues, as this has now been benchmarking for 3 hours with the exact same settings and 103.0.5014.0 is still even improving it´s distance to all the other builds. I guess you must have used some new compiler optimizations in 103.0.5014.0? Otherwise justclueless build of 103 should have the same performance, but it doesn´t.

Thanks for all the hard work, you help a lot to speed up my work! I appreciate that.

Alex313031 commented 2 years ago

@dabugen It's because he only uses AVX2. My build.gn includes AVX, AVX2, as well as AES. It also has -O3 set for cflags and ldflags, import_instr_limit set to 100 vs. 30 for vanilla chromium. And it has other optimizations that I either made myself, or came from RobRich (RIP Rob's builds, we had a long discussion, me and him, on his GitHub. I miss talking to him about Chromium). These include setting Rust components to also build with -O3 and SSE4AVX/AVX2/AES. And finally the LOOP optimizations rob shared with me as well as some new ones from upstream LLVM. These include: (copy/pasted from the main Thorium BUILD.gn)

"-mllvm", "-extra-vectorizer-passes", "-mllvm", "-enable-cond-stores-vec", "-mllvm", "-slp-vectorize-hor-store", "-mllvm", "-enable-loopinterchange", "-mllvm", "-enable-loop-distribute", "-mllvm", "-enable-unroll-and-jam", "-mllvm", "-enable-loop-flatten", "-mllvm", "-interleave-small-loop-scalar-reduction", "-mllvm", "-unroll-runtime-multi-exit", "-mllvm", "-aggressive-ext-opt",

I have combed through the build.gn files for all platforms as well as the main one, and can confidently say that I have the most optimized Chromium build out there. There's nothing else that you could optimize more, except for Polly, which Rob used for a couple of releases before ditching it because it was too much of a hassle and hacky to get it to work. I contacted him to try to give me a walkthrough on how to do it, because I wouldn't mind having to work on it to use it in Thorium, but then he stopped making releases and stopped responding on Github. He had some health issues (that's why), so I hope he's okay.

EDIT: Ok so I kinda assumed you knew what cflags and shit are, so here's an explanation of the above stuff if you dont. So llvm/clang/Gcc have the main compiler optimizations as such: -O0 (none), -Og (debug), -O1 (low), -O2 (default), and -O3 (high). Ldflags are the same except they are for the linker not the compiler, and use -Wl,-O3 as the -Wl, has to be before -O3 to tell it to pass it to the linker.

Chromium and ChromiumOS are increasingly using Rust for components (although the majority is still C++), and rust has its own flags called -Copt-level, which are set to O3 the same way, and it has its own instruction set extension flags which are set to -Ctarget-feature=+sse4,+aes,+avx,+avx2. The import_instr_limit = 100 tells it to look at up to 100 instructions when deciding how to condense them down using LOOP optimizations or VEX encoding (for AVX/AVX2/AVX-512).

Thanks for using Thorium, and It makes me overjoyed when I get reports that my projects are helping people in the real world with real world work, and thanks for giving me feedback. I want all the fedback I can get, good and bad, so never hesitate to open a bug issue, or to suggest changes.

dabugen commented 2 years ago

Thanks so much for the explanation. I can totally confirm that the performance of your builds is second to none - my benchmarks show it, and I am a freak when it comes to that, having tested (and still always testing) every possible custom build out there. Yours are by a margin of 10% faster almost all the time, even compared to the ones at Woolys. My use case is not the typical Chromium surfing-user one, but I do number crunching to develop trading systems (as already mentioned). So the CPU usage of each Chrome instance is always maxed out 24/7, and having such a greatly optimized build simply gives 10% more systems all the time, which has an amazing effect on the total number of systems if running it for days/weeks. So thank you again for all the efforts you put into this and maybe you can even get Polly working one day - I am certainly ready to test what difference it would make for my usecase.

As for the compiler options, I am somewhat familiar with them as I do some coding as well, but definitely not as deep as you and I haven´t worked with Rust either yet, so it´s great to see you are optimizing the compilation of this too, super amazing. As for the "import_instr_limit" at 100, can this be set even higher with possibly even more speed benefits? Just wondering as it currently seems to be the only variable where one could go even deeper possibly, as you´ve already maxed out all the rest, haha ;-)

Thank you once again and have a great weekend ahead :-)

Alex313031 commented 2 years ago

@dabugen Yup I usually get ~8 to 10% performance improvement over vanilla Chromium, within margin of error. I'll try hittin up Rob again, also just to try to make sure hes still f***kin alive! If not, I'll look through our long github discussion, his source code, and polly documents to see if I can piece together how to do it.

And import_isntr_limit is hard capped by LLVM at 100. If you were to say, modify and compile LLVM and Clang yourself, then point the Chromium toolchain to use it rather than the prebuilt internal one, then you could raise it, but they set it at 100 for a reason, beyond that you start getting diminishing returns and eventually ridiculous binary bloat (Thorium is already ~40% larger than vanilla Chromium). It might be interesting to try, and If I ever do I'll let you know. But yeah other than polly, theres really nothing else. I check the LLVM mailing lists occasionally to see if theyve added any new compiler flags/optimizations that I could use.

I use Crunchbangplusplus linux as my daily driver, and dual boot with windows xp, 7, (still use it more than 10), 10, and 11. I'm a speed freak as well, so certain programs like atom, htop, gnome-system-monitor, firefox, and thorium that I use daily I compile from source with high optimization. The best would be Gentoo, as you can compile everything from source custom tailored to your system, but its finicky, complicated, and takes a long time. I've installed Gentoo in a VM and decided nope, sticking with ubuntu/debian based distros.

So how did you find out about thorium? And do you use it on any other platforms like Linux or MacOS. Or ThoriumOS lol > https://github.com/Alex313031/ChromiumOS

dabugen commented 2 years ago

Absolutely amazing that you are putting so much detail into the optimization, that´s just like me indeed :-) I don´t know how long I´ve been tuning my GraalVM with command-line options, it never ends LOL. Sometimes you get sucked into all the tuning and can spend weeks and months on it just to have everything performing as perfectly as possible, haha. So thanks a lot for all the explanations and I hope you´ll never stop exploring more options!

I am solely on Windows 11 these days (completely crazy tuned as usually, down to disabling services, drivers, and even kernel drivers, bringing it down to its bones for top performance, it looks like Windows 95 now haha). I´ve found Thorium on Woolyss.

Thanks again!

dabugen commented 2 years ago

Sorry to hear about your hardware crash, that sucks so much!

I was wondering, is the latest build M103.0.5045.0 just as optimized as your previous builds? Given that another person built it.

Thank you.

Alex313031 commented 2 years ago

@dabugen Yes I wrote a guide for building THorium yourself natively on windows, and cross building for windows on linux. He followed it, and I gave him access to my private API keys. It is exactly the same as any release I would make, it is just made by another person since I don't have the hardware to do it.

Alex313031 commented 2 years ago

@dabugen Also, to allude to the original point of this issue. Heres some screenshots of vanilla chromium vs thorium.

Make note that as a vanilla Chromium release, it is not built with thinLTO, which hurts performance. The difference between something like hibikkis release and mine would be smaller, but still there. Also keep in mind the scores for all of them in general are lower than what you and alot of people might get, simply because I'm on an old CPU, an FX-8370 OC'd to 4.7Ghz. all cores.

Chromium: Chromium_Vanilla_Octane_V2 Chromium_Vanilla_Speedometer

Thorium: Thorium_Octane_V2 Thorium_Speedometer

Artoriuz commented 2 years ago

Hi @Alex313031

I'd first like to thank you for your excellent work!

I've been comparing some chromium builds on my machine and I don't seem to be getting any meaningful difference from Thorium-AVX2. I'm using speedometer 2.0 to compare them, since it runs pretty quickly making it less time consuming to compare multiple builds.

As far as I understand, the performance difference between the builds should mostly come from either different compiler flags or perhaps from different default chrome flags.

I'm running all of these tests on a 5600X, and since Zen 3 has great AVX2 performance so I expected some performance improvement.

In any case, this is what I got from lowest to highest:

Speedometer 2.0:

Chromium - 161.1 Chromium (Hibbiki) - 231 Thorium - 233.1 Thorium AVX2 - 234.1 Chrome Canary - 243 Chromium (justclueless) - 271

As far as I know justclueless also builds with similar settings, but I'm seeing much higher performance on his build.

I honestly have no idea why this would be the case, so I thought I'd share these results with you to see what you think about them.

Thanks in advance.

Alex313031 commented 2 years ago

@Artoriuz @dabugen See new website for thorium along with a new deb repo for auto updating on linux. > https://thorium.rocks/

dabugen commented 2 years ago

Amazing! Thanks for all the hard work. I´ll also test the newest AVX2 build today - really appreciate all the time you are putting into this enthusiast project!

Alex313031 commented 2 years ago

@Artoriuz Also, I have also noticed the M104 builds are having lower performance than usual. I have not changed the compiler options. Since thorium is based on tip of tree chromium, that means fixes and features faster, but also means it gets any performance regressions or bugs too. I believe this is an upstream thing. Thorium M104 is still outperforming vanilla chromium however, so Im not worried. If it persists for many releases or gets worse, then i will look into it on my end.