RobRich999 commented 2 years ago

Discussion regarding Chromium builds and related topics.

RobRich999 commented 2 years ago

@char101 Do you intend to run Chromium, or better yet, a modern OS like Win10/11 on a system with 1-2GB of memory? o.0

If memory bloat is really a concern, you might as well build Chromium with Clang at -Oz and ThinLTO at -O0 with an instruction import limit of 5. Not my idea of a performant build, but to each his/her own I suppose.

Alex313031 commented 2 years ago

Also @Paukan777 im gonna be a grammar nazi and point out on windows it's chrome.dll not chrome.so

Alex313031 commented 2 years ago

@RobRich999 Yeah win 10 now at 21H1 needs at least 2Gb to even run, much less run an application like chromium with more than a tab open lol.

Alex313031 commented 2 years ago

Also to everyone, I'm making my first Thorium release for windows right now, and this includes chromedriver and content_shell in the mini_installer, for shits and giggles. (I like having a FULL release, with everything you would need for fiddling and testing)

https://github.com/Alex313031/Thorium-Win

Paukan777 commented 2 years ago

Also @Paukan777 im gonna be a grammar nazi and point out on windows it's chrome.dll not chrome.so

I haven't used Windows for almost 2 years and therefore I automatically wrote .so implying .dll ))

char101 commented 2 years ago

@char101 Do you intend to run Chromium, or better yet, a modern OS like Win10/11 on a system with 1-2GB of memory? o.0

I run Windows 10 with 64 GB memory backed with 32 GB pagefile in intel optane. In my use case you should think 500 open tabs, and that probably after 1 month of not restarting Windows (only suspending it). 64 bit build probably use 20-30% more memory than 32 bit build, multiplied with that number of tabs the memory saving is quite significant.

So in this case it is memory efficiency rather than memory constraint. Memory efficiency is also part of optimization. Since chromium is largely multiprocess, the memory limit of 32-bit process does not really matter. IMO the extra performance of 64-bit build is being offset by its larger memory usage so 64-bit build is not the clear winner, it depends on the use case, whether the user prefers performance or lower memory usage.

RobRich999 commented 2 years ago

I used to load hundreds of tabs years ago when auditing certain datasets, but the actual pages were rather lightweight and readily handled by Opera; back in the days before it migrated to Blink.

Anyway, that would be a corner-case scenario for the majority users. ;)

char101 commented 2 years ago

I used to load hundreds of tabs years ago when auditing certain datasets, but the actual pages were rather lightweight and readily handled by Opera; back in the days before it migrated to Blink.

That bring back memories, I was using Opera too and indeed it has a low memory usage even when loading a lot of tabs. I feel that I enjoy using Opera more than current modern browsers.

Alex313031 commented 2 years ago

@RobRich999 Have you noticed the latest win builds (mine too) have UI buttons being red? Have you also noticed that if it is not set to compatibility mode for win 8 or 7 when running on win 10 or 11, that all the tabs die with 'Error code: RESULT_CODE_MISSING_DATA'. Both are absolutely bizarre.

RobRich999 commented 2 years ago

@Alex313031 You will need to update LLVM: https://bugs.chromium.org/p/chromium/issues/detail?id=1265339

Alex313031 commented 2 years ago

@RobRich999 how? Just do a manual rm -r depot_tools & .vpython_cipd_cache & .vpython-root, and recreate with a rebase-update and gclient sync?

RobRich999 commented 2 years ago

No need to manually remove anything. Updating your Chromium git checkout and running gclient sync will pull the latest LLVM build done by the Chromium project itself. :)

I know we each tend to have our own ways of updating Chromium, but anyway here is mine for ToT builds on Linux:

export PATH="$PATH:${HOME}/depot_tools" cd depot_tools git checkout -f main git pull --rebase cd chromium/src git checkout -f main git pull --rebase gclient sync --with_branch_heads -f -R -D

BTW, you can use Chromium's own LLVM script to build the latest checkout. For example on Linux:

python3 /home/robrich/depot_tools/chromium/src/tools/clang/scripts/build.py --without-android --without-fuchsia --llvm-force-head-revision --disable-asserts --gcc-toolchain=/usr --bootstrap

That does a bootstrap build. It pulls the current LLVM ToT checkout, generates a base LLVM build using GCC, then uses that to do the final LLVM build. There are other options noted in the build script. I would not bother with LTO and PGO for building LLVM unless on a fast modern system.

Also one can edit the same script to modify various LLVM build options. I add Polly to the projects, which in turns requires enabling LLVM plugins and PIC in the build options. I also only build for x86 to save a little build time since I do not target other architectures.

RobRich999 commented 2 years ago

To test a new build environment on my AMD 8c/16t notebook, yesterday I rolled a Win64 AVX2 build with modifications to Chromium and LLVM to specifically targe the system's Zen 2 architecture. End result? No appreciable performance difference compared to my usual Win64 AVX build. As expected from previous experiences, and once again confirmed, we have pulled most of the "low hanging fruit" already with my existing optimizations.

Alex313031 commented 2 years ago

@RobRich999 Cool! Also is it okay to set cflags to -03. I currently have lto_opt_level set to 3, import_instr_limit to 30, and cflags to -02. Can I change it to -03 safely. Also, can all of this be done with -march=icelake-client and "-ffp-contract=fast", put right under it?

Also, why do you use git pull --rebase instead of git rebase-update, and why dont you add --with_tags to gclient sync?

RobRich999 commented 2 years ago

Yeah, you might as well set clang to -O3 since you already have LTO at -O3. :)

On Linux, search for "-march=$x64_arch" in //chromium/src/build/config/compiler/BUILD.gn and change to your desired arch, remove the "-msse3" flag, and set FMA generation.

On a Windows build, you will need to do similar by searching for "-msse3" in //chromium/src/build/config/win/BUILD.gn and replacing it with your desired arch, FMA generatrion, etc.

Make sure your are definitely disabling AVX, AVX2, FMA, etc. for tflite. Otherwise it will fail to build.

/chromium/src/third_party/tflite/BUILD.gn

Add to the existing clfags:

"-march=x86-64",
"-msse3",
"-mno-avx",
"-mno-avx2",
"-mno-fma",

Being somewhat pedantic, it could be slightly more optimized like:

"-march=x86-64",
"/clang:-mtune=znver2",
"-msse4.2",
"-mno-avx",
"-mno-avx2",
"-mno-fma",

With znver2 being replaced with whatever arch you are targeting.

I handle revision updates the old way, similar to how Nik used to do it for his builds back in the day. Just habit, and it works okay.

I finally have gotten around to letting gclient handle PGO profile updates. Edit //chromium/.gclient to add a custom variable. For example:

solutions = [ { "name": "src", "url": "https://chromium.googlesource.com/chromium/src.git", "managed": False, "custom_deps": {}, "custom_vars": { "checkout_pgo_profiles": True, }, }, ]

Then running gclient sync will pull the PGO profiles for your platform. I have not tested it, yet, but AFAIK it should work for cross builds assuming the other OS platforms are in the .gclient config.

Make sure "chrome_pgo_phase = 2" is in the build args as usual, and you can forget about having to manually set the "pgo_data_path =" arg.

Right now I am tracking down a LLVM issue. Yay? Once resolved, I might return public Win64 AVX2 builds. Not that there is enough performance difference to care IMO, but it seemed popular.

RobRich999 commented 2 years ago

LLVM issue found.

https://reviews.llvm.org/rG4d8fff477e024698facd89741cc6cf996708d598

Using git revert --no-commit 4d8fff477e024698facd89741cc6cf996708d59 to resolve it locally for now.

Bug report filed with Chromium project.

RobRich999 commented 2 years ago

LLVM project has reverted it for now. That was a rather quick response time, especially for weekend. Yay!

https://reviews.llvm.org/rG6438a52df1c7f36952b6126ff7b978861b76ad45

RobRich999 commented 2 years ago

BTW, to cover the bases here, take note you need to build on hardware with the same level of instruction support if doing a native build.

For example, I can not do a Win64 AVX2 build natively on my Win64 Opteron system, as the AMD Piledriver arch does not support AVX2.

However, cross building is different since the target platform binaries are not actually executed. For example I can do a Win64 AVX2 cross build on my Linux Opteron system despite the procs not supporting AVX2.

RobRich999 commented 2 years ago

And of course the AVX2 build fails. lol

Paukan777 commented 2 years ago

LLVM project has reverted it for now. That was a rather quick response time, especially for weekend. Yay!

https://reviews.llvm.org/rG6438a52df1c7f36952b6126ff7b978861b76ad45

Looks like there's another llvm bug in latest checkout

[50394/50421] LINK ./v8_context_snapshot_generator FAILED: v8_context_snapshot_generator python3 "../../build/toolchain/gcc_link_wrapper.py" --output="./v8_context_snapshot_generator" -- ../../../../../llvm-14.0.0/bin/clang++ -fuse-ld=lld -Wl,--fatal-warnings -Wl,--build-id=sha1 -fPIC -Wl,-z,noexecstack -Wl,-z,relro -Wl,-z,now -Wl,--icf=all -Wl,--color-diagnostics -Wl,-mllvm,-instcombine-lower-dbg-declare=0 -flto=thin -Wl,--thinlto-jobs=all -Wl,--thinlto-cache-dir=thinlto-cache -Wl,--thinlto-cache-policy=cache_size=10\%:cache_size_bytes=40g:cache_size_files=100000 -Wl,-mllvm,-import-instr-limit=30 -fwhole-program-vtables -m64 -no-canonical-prefixes -Wl,-mllvm,-polly -Wl,-mllvm,-polly-detect-profitability-min-per-loop-insts=40 -Wl,-mllvm,-polly-invariant-load-hoisting -Wl,-mllvm,-polly-vectorizer=stripmine -Wl,-O3 -Wl,--gc-sections -rdynamic -nostdlib++ --sysroot=../../build/linux/debian_sid_amd64-sysroot -Wl,--lto-O2 -Wl,-z,defs -Wl,--as-needed -fsanitize=cfi-vcall -fsanitize=cfi-icall -pie -Wl,--disable-new-dtags -Wl,--icf=none -o "./v8_context_snapshot_generator" -Wl,--start-group @"./v8_context_snapshot_generator.rsp" -Wl,--end-group -ldl -lpthread -lrt -lgmodule-2.0 -lgobject-2.0 -lgthread-2.0 -lglib-2.0 -lnss3 -lnssutil3 -lsmime3 -lplds4 -lplc4 -lnspr4 -lresolv -lgio-2.0 -lexpat -luuid -lm -lz -lX11 -lXcomposite -lXdamage -lXext -lXfixes -lXrender -lXrandr -lXtst -lgbm -lEGL -ldrm -lxcb -lxkbcommon -lwayland-client -ldbus-1 -lpangocairo-1.0 -lpango-1.0 -lharfbuzz -lcairo -latk-1.0 -latk-bridge-2.0 -lXi -lpci -lasound -latspi Instruction does not dominate all uses! %85 = and i64 %84, 1 %87 = phi i64 [ %4, %17 ], [ %85, %83 ], [ %85, %293 ] Instruction does not dominate all uses! %84 = load i64, i64* %2, align 8 %88 = phi i64 [ %3, %17 ], [ %84, %83 ], [ %84, %293 ] LLVM ERROR: Broken module found, compilation aborted! PLEASE submit a bug report to https://crbug.com and run tools/clang/scripts/process_crashreports.py (only works inside Google) which will upload a report and include the crash backtrace. LLVM ERROR: Failed to rename temporary file thinlto-cache/Thin-bbddd1.tmp.o to thinlto-cache/llvmcache-5108B680AB41B712A9E0A32DA8C9A559725A5592: No such file or directory

Alex313031 commented 2 years ago

Thanks for the cflag info, and adding the pgo to gclient is nice. And ahh so thats why my avx2 builds failed on my piledriver fx-8370. TFlite compiles fine with avx and fma set? It's only with AVX2 that it has issues.

Paukan777 commented 2 years ago

Anyone built successfully chromium 98 ? I got "Instruction does not dominate all uses!" on lto phase with both llvm 14 and 13

Upd: I figured out that this error caused by polly. So these flags don't work anymore (for me):

"-Wl,-mllvm,-polly", "-Wl,-mllvm,-polly-detect-profitability-min-per-loop-insts=40", "-Wl,-mllvm,-polly-invariant-load-hoisting", "-Wl,-mllvm,-polly-vectorizer=stripmine",

RobRich999 commented 2 years ago

@Alex313031 TFlite might be fixed by now. Either way, no point enabling FMA unless you have AVX2 support as well. The only x86 procs with FMA but without AVX2 are AMD Bulldozer, Piledriver, and similar; and even for those it was FMA4 instead of the common FMA3 used by Intel and later AMD procs.

https://en.wikipedia.org/wiki/FMA_instruction_set

@Paukan777 Yeah, Polly can cause build issues at times. You can also try just using Polly itself without any extra config options, as build breakers in my experience tend to be with invariant load hoisting and stripmining.

RobRich999 commented 2 years ago

@Paukan777 BTW, it is possible to run Polly the old way during Clang codegen. Ideally it should be run right before vectorization, which is during LTO codegen for well, LTO builds.

To run during Clang codegen, move Polly up into the common_optimize_on_cflags.

"-mllvm", "-polly",
"-mllvm", "-polly-detect-profitability-min-per-loop-insts=40",
"-mllvm", "-polly-invariant-load-hoisting",
"-mllvm", "-polly-position=early",
"-mllvm", "-polly-vectorizer=stripmine",
"-Xclang", "-Rpass-analysis=polly",

That is how we run Polly back in the days before DeLICM support was added to it. Technically -polly-position=after-loopopt is available as well, but early is the preferred position assuming the default option of before vectorization can be not done.

You can also add "-mllvm", "-polly-run-inliner", with the early position to run an additional inlining pass, which can help pull more code into functions for Polly to optimize. YMMV on build time and any actual performance differences, though.

Paukan777 commented 2 years ago

@RobRich999

../../../../../llvm-14.0.0/bin/clang -MMD -MF obj/third_party/zlib/zlib_adler32_simd/adler32_simd.o.d -DUSE_UDEV -DUSE_AURA=1 -DUSE_GLIB=1 -DUSE_NSS_CERTS=1 -DUSE_OZONE=1 -DUSE_X11=1 -DOFFICIAL_BUILD -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -DNO_UNWIND_TABLES -D_GNU_SOURCE -DSTDC_CONSTANT_MACROS -DSTDC_FORMAT_MACROS -D_FORTIFY_SOURCE=2 -D_LIBCPP_ABI_UNSTABLE -D_LIBCPP_DISABLE_VISIBILITY_ANNOTATIONS -D_LIBCXXABI_DISABLE_VISIBILITY_ANNOTATIONS -D_LIBCPP_ENABLE_NODISCARD -DCR_LIBCXX_REVISION=79a2e924d96e2fc1e4b937c42efd08898fa472d7 -DCR_SYSROOT_HASH=95051d95804a77144986255f534acb920cee375b -DNDEBUG -DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -DZLIB_IMPLEMENTATION -DADLER32_SIMD_SSSE3 -DX86_NOT_WINDOWS -I../.. -Igen -I../../buildtools/third_party/libc++ -mssse3 -fno-delete-null-pointer-checks -fno-ident -fno-strict-aliasing --param=ssp-buffer-size=4 -fstack-protector -fno-unwind-tables -fno-asynchronous-unwind-tables -fPIC -pthread -fcolor-diagnostics -fmerge-all-constants -fcrash-diagnostics-dir=../../tools/clang/crashreports -mllvm -instcombine-lower-dbg-declare=0 -flto=thin -fsplit-lto-unit -fwhole-program-vtables -fcomplete-member-pointers -m64 -march=skylake -mavx2 -ffp-contract=fast -ffile-compilation-dir=. -no-canonical-prefixes -Wall -Wextra -Wimplicit-fallthrough -Wunreachable-code-aggressive -Wthread-safety -Wextra-semi -Wno-missing-field-initializers -Wno-unused-parameter -Wloop-analysis -Wno-unneeded-internal-declaration -Wenum-compare-conditional -Wno-psabi -Wno-ignored-pragma-optimize -Wshadow -O3 -mllvm -polly -mllvm -polly-detect-profitability-min-per-loop-insts=40 -mllvm -polly-invariant-load-hoisting -mllvm -polly-position=early -mllvm -polly-vectorizer=stripmine -Xclang -Rpass-analysis=polly -fdata-sections -ffunction-sections -fno-unique-section-names -fno-omit-frame-pointer -g0 -ftrivial-auto-var-init=pattern -fprofile-instr-use=//mnt//CACHE//depot_tools//chromium//src//chrome//build//pgo_profiles//chrome-linux-main-1637214862-1c2661341c9ed6b3489d342fa76bdb1b2e16835d.profdata -Wno-profile-instr-unprofiled -Wno-profile-instr-out-of-date -Wno-backend-plugin -fsanitize=cfi-vcall -fsanitize-ignorelist=../../tools/cfi/ignores.txt -fsanitize=cfi-icall -fvisibility=hidden -Wheader-hygiene -Wstring-conversion -Wtautological-overlap-compare -O3 -mllvm -polly -mllvm -polly-detect-profitability-min-per-loop-insts=40 -mllvm -polly-invariant-load-hoisting -mllvm -polly-position=early -mllvm -polly-vectorizer=stripmine -Xclang -Rpass-analysis=polly -fdata-sections -ffunction-sections -fno-unique-section-names -std=c11 --sysroot=../../build/linux/debian_sid_amd64-sysroot -c ../../third_party/zlib/adler32_simd.c -o obj/third_party/zlib/zlib_adler32_simd/adler32_simd.o clang (LLVM option parsing): for the --polly-detect-profitability-min-per-loop-insts option: may only occur zero or one times! [1527/50467] ACTION //extensions/browser/api:api_registration_bundle_generator_registration(//build/toolchain/linux:clang_x64) ninja: build stopped: subcommand failed.

Adding polly to common_optimize_on_cflags doesn't work as arguments doubled for some reason and build fails... Can I add it to cflags where I add avx, etc?

RobRich999 commented 2 years ago

You can try that, or instead comment comment out these two lines like I do:

//chromium/src/third_party/libgav1/BUILD.gn

configs += [ "//build/config/compiler:optimize_max" ]

//chromium/src/third_party/zlib/BUILD.gn

configs += [ "//build/config/compiler:optimize_speed" ]

They will still be optimized according to the default compiler optimization config, which if you are doing like me, is modified to -O3 like the other two anyway, so no big deal.

The build configs as set in those third-party components create duplicate cflags. Clang will ignore many duplicates, but not duplicates for Polly. I have never bothered reporting the issue with those two components, as they work okay with an unmodified Chromium build.

Alex313031 commented 2 years ago

@RobRich999 Bulldozer only has FMA4. But piledriver/steamroller/excavator have BOTH FMA3 and FMA4. And it was FMA3 that is the common one supported nowadays, FMA4 is the weird one only supported by "heavy equipment" processors. I build with avx and fma because my system benefits from it, and because I cant natively build for avx2. But anyone running haswell or later will be able to run my thorium releases. Is there a way to "cross compile" for linux on linux that will allow me to target avx2.

RobRich999 commented 2 years ago

@Alex313031 Intel kind of screwed up the situation with FMA4. AMD thought FMA4 as going to be the standard, then Intel instead settled on FMA3 for Haswell. On the AMD side, I tend to just avoid FMA anything before Zen. ;) Same thing for AVX2 despite Excavator being out there.... somewhere.... being used by.... someone? Hmmm.

Cross compiling x86 Linux on x86 Linux is not supported natively.

RobRich999 commented 2 years ago

I have Win64 AVX2 once again cross compiling on my Linux build box.

https://github.com/RobRich999/Chromium_Clang/releases/tag/v98.0.4715.0-r943497-win64-avx2

Performance is roughly the same as Win64 AVX on my AMD 5700u system. I have not tested on my Intel Kaby Lake system, so I will say simply "whatever" and YMMV. I will try to keep it on the same refresh cycle as my other Win64 builds barring no significant issues.

Paukan777 commented 2 years ago

@RobRich999 I see you rolled out new release. How you did polly optimizations avoiding linker crash?

RobRich999 commented 2 years ago

@Paukan777 As noted a few posts back, I moved Polly back to early during Clang (clfags) codegen, thus I am not running it during link-time LTO (ldflags) codegen for now. It is not as ideal for optimizations as running Polly before vectorization during LTO codegen, but it works.

RobRich999 commented 2 years ago

I have a Win64 AVX2 test build without Polly optimizations though with various additional LLVM loop optimizations and the LTO instruction import limit upped to 100.The test build is faster than my current release build with Polly in the usual listings, at least in multiple benchmarks on my AMD 5700u system.

https://github.com/RobRich999/Chromium_Misc/releases/tag/v98.0.4722.0-r943907-win64-avx2

I suspect Polly at early during Clang codegen could be interfering with PGO, and Polly is borked during LTO codegen for building Chromium v98 right now, so I am poking at other potential optimizations. Actually I am not even sure we are going to need to the extra LLVM loop passes, but I will have to run a build or few to evaluate. I have been needing to revisit various LLVM optimizations anyway. I will let ya'll know, including whatever, if any extra LLVM optimizations I might find beneficial.

Alex313031 commented 2 years ago

@RobRich999 What LLVM loop optimizations are you using. And I STILL cannot get cross building for windows to work. Would you be able to send your zip of VS artifacts and list the exact steps you use to cross build, so that I can "recreate" your workflow. What does upping the instruction import limit do? I have mine set to 30 as per your suggestion, but I'll try 100. You could set FMA for the avx2 builds, because all processors supporting avx2 also support fma. Setting ffp-contract-fast only builds FMA3 so its not a big deal. What I do for my personal builds though is set the microarch to bdver2 because I have a piledriver cpu, which optimizes for every instruction it supports, including FMA4. My thorium builds are avx with the ffp-contract=fast, but I might turn that off so that people with 2nd-3rd gen intel cpus can use it. This is because on amd's side every cpu that supports avx will also support fma, but only haswell and above supports fma and intel 2nd and 3rd gen still have avx. I also added a patch for pulseaudio to thorium and disabled FLOC using ungoogled-chromiums patch. You might be interested in the pulseaudio patch, as it was recently added to the debian, ubuntu, AND arch chromium packages, and just seems something that probably should be just added to upstream chromium.

RobRich999 commented 2 years ago

I use FMA3 in AVX2 builds. ;)

I suspect you will find no performance difference with FMA4 disabled for your AMD Piledriver procs. I doubt LLVM devs ever spent much time and effort on FMA4 optimizations. BTW, my Opteron systems are Piledriver-base as well.

I had posted about the loop and autovec opts at woolyss.com yesterday, though I was half asleep and did not copy-and-paste here. Anyway, here ya' go:

common_optimize_on_ldflags += [ "-mllvm:-extra-vectorizer-passes", "-mllvm:-enable-cond-stores-vec", "-mllvm:-slp-vectorize-hor-store", "-mllvm:-enable-loopinterchange", "-mllvm:-enable-loop-distribute", "-mllvm:-enable-unroll-and-jam", "-mllvm:-enable-loop-flatten", "-mllvm:-interleave-small-loop-scalar-reduction", "-mllvm:-unroll-runtime-multi-exit", "-mllvm:-aggressive-ext-opt", ]

common_optimize_on_cflags += [ "-mllvm", "-extra-vectorizer-passes", "-mllvm", "-enable-cond-stores-vec", "-mllvm", "-slp-vectorize-hor-store", "-mllvm", "-enable-loopinterchange", "-mllvm", "-enable-loop-distribute", "-mllvm", "-enable-unroll-and-jam", "-mllvm", "-enable-loop-flatten", "-mllvm", "-interleave-small-loop-scalar-reduction", "-mllvm", "-unroll-runtime-multi-exit", "-mllvm", "-aggressive-ext-opt", ]

Some of those loop opt passes are similar to what Polly does anyway, but done differently instead of using ISL like Polly. Otherwise the optimization config is -O3 across the board for Clang and ThinLTO, and the LTO instruction import limit bumped up to the LLVM default of 100 (for testing right now anyway).

The higher the limit, the more instructions LTO can pull across a codebase when doing analysis and optimization. See a basic example of LTO here:

https://llvm.org/docs/LinkTimeOptimization.html#example-of-link-time-optimization

LLVM defaults the limit to 100. Chromium sets the limit to 5 to save build time and limit binary bloat, while retaining a significant percent of LTO performance benefits. Chromium tends to favor code size as long as performance is not negatively impacted, while I tend to not care much about binary sizes.

The ChromeOS project did an analysis of LTO limits a few years ago, and it determined 30 was a good compromise for its builds. I do not have the link right now, but IIRC, the performance difference between 30 and 100 for its codebase was like a percent or so. Accordingly, I opted for the same limit.

I will see about cooking up a condensed cross-build guide. In the meantime, are you able to generate the VS artifacts? Are you generating them from the Windows v19041 SDK?

Alex313031 commented 2 years ago

@RobRich999 Thanks for the explanation as always. Yes I'm able to make a zip of the artifacts, and yes Im using latest 2019 sdk. My issues begin when doing the steps after https://chromium.googlesource.com/chromium/src/+/refs/heads/main/docs/win_cross.md#if-you_re-not-at-google

Also, are those all the loop optimizations possible, or did you exclude some. And do you have a link to all the possible mllvm values? Similar to this page but for mllvm https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

RobRich999 commented 2 years ago

Okay, it is probably an issue passing the artifact value. I will post more details when I have my Linux build box online.

LLVM has various loop optimization passes. I added most of the ones not already active, which tends to be because they still are in development; sometimes after years of work. Off the top of my head, I am not running loop versioning for LICM due to it routinely breaking Chromium in my previous experiences and loop fusion because it is not even in the pass manager at this time.

You can obtain a list of LLVM optimizations and passes from the opt binary. It is a long list, so pipe it to a file IMO.

opt.exe --help-hidden > optimizations.txt

You might want to do similar for clang and clang-cl, though note clang-cl is /help like MSVC instead of --help-hidden.

RobRich999 commented 2 years ago

On a positive note, despite somewhat bloating the output binaries, it does not appear Polly at early in Clang does anything detrimental to PGO. So.... it likely will remain there in my release builds for now.

RobRich999 commented 2 years ago

I am not going to get too bogged down into benchmark noise and micro-optimizations, as I could end up spending days rolling numerous test builds to find a percent or less performance difference. Been there, done there in the past. Not so much anymore.

I added basic Polly at early in Clang codegen back into the mix, and it seems to be playing okay with the other added loop and autovec passes in my internal builds. I did end up dropping Polly's early inlining pass, and it appeared to do little more than inflate binary size. DeLICM is disabled since it does little to nothing for Polly at early.... well, other than add build time.

common_optimize_on_cflags += [ "-mllvm", "-extra-vectorizer-passes", "-mllvm", "-enable-cond-stores-vec", "-mllvm", "-slp-vectorize-hor-store", "-mllvm", "-enable-loopinterchange", "-mllvm", "-enable-loop-distribute", "-mllvm", "-enable-unroll-and-jam", "-mllvm", "-enable-loop-flatten", "-mllvm", "-interleave-small-loop-scalar-reduction", "-mllvm", "-unroll-runtime-multi-exit", "-mllvm", "-aggressive-ext-opt", "-mllvm", "-polly", "-mllvm", "-polly-enable-delicm=false", "-mllvm", "-polly-position=early",

"-Xclang", "-Rpass-analysis=polly",

]

common_optimize_on_ldflags += [ "-mllvm:-extra-vectorizer-passes", "-mllvm:-enable-cond-stores-vec", "-mllvm:-slp-vectorize-hor-store", "-mllvm:-enable-loopinterchange", "-mllvm:-enable-loop-distribute", "-mllvm:-enable-unroll-and-jam", "-mllvm:-enable-loop-flatten", "-mllvm:-interleave-small-loop-scalar-reduction", "-mllvm:-unroll-runtime-multi-exit", "-mllvm:-aggressive-ext-opt", ]

Also I did try various register allocation and optimization passes that I have used in the past, though the results were flat to even negative territory. Seems to be the case IME since the new pass manager landed. Whatever.

Moving on to the next step, now I am comparing the LTO instruction import limit at 30 vs 100.

Thinking down the road, at some point I probably should revisit opt levels instead of brute forcing -O3, though that is not really on my priority list right now.

Chromium aside, tomorrow is Turkey Day here. Happy holidays to everyone!

RobRich999 commented 2 years ago

Just benchmarked the latest Edge Canary build. Performance is about the same as my latest internal build. Ack! Looks like I will be going back to the drawing board. There is not much reason for me to build Chromium I can simply click install to obtain the same performance. Hmmm.

RobRich999 commented 2 years ago

I will push an updated Win64 AVX2 build in a few minutes. -O3 across the board, LTO import limit at 30, and the following extra optimizations:

common_optimize_on_cflags += [ "-mllvm", "-extra-vectorizer-passes", "-mllvm", "-enable-cond-stores-vec", "-mllvm", "-slp-vectorize-hor-store", "-mllvm", "-enable-loopinterchange", "-mllvm", "-enable-loop-distribute", "-mllvm", "-enable-unroll-and-jam", "-mllvm", "-enable-loop-flatten", "-mllvm", "-interleave-small-loop-scalar-reduction", "-mllvm", "-unroll-runtime-multi-exit", "-mllvm", "-aggressive-ext-opt", "-mllvm", "-polly", "-mllvm", "-polly-detect-profitability-min-per-loop-insts=40", "-mllvm", "-polly-position=early", "-Xclang", "-Rpass-analysis=polly", ]

common_optimize_on_ldflags += [ "-mllvm:-extra-vectorizer-passes", "-mllvm:-enable-cond-stores-vec", "-mllvm:-slp-vectorize-hor-store", "-mllvm:-enable-loopinterchange", "-mllvm:-enable-loop-distribute", "-mllvm:-enable-unroll-and-jam", "-mllvm:-enable-loop-flatten", "-mllvm:-interleave-small-loop-scalar-reduction", "-mllvm:-unroll-runtime-multi-exit", "-mllvm:-aggressive-ext-opt", ]

DeLICM can be disabled in Polly if desired. It might save a few minutes of build time. Whatever IMO.

I have dropped stripmining in Polly for now. Being realistic, it is not really a vectorizer but instead a pre-vectorization pass. It uses a fixed prevec width to generate chunks, which might not be optimal considering we are dealing with such a large codebase of varied code.

Note the loop flatten pass might break building Chromium depending upon the specific optimization config. YMMV. Just a FYI in case.

Chromium Win64 AVX2 r945375 (O3, LTO 30, Polly) JetStream2: 159 MotionMark1.2: 941 Speedometer2.0: 140

Chromium Win64 r945494 (project dev build) JetStream2: 155 MotionMark1.2: 750 Speedometer2.0: 110

Alex313031 commented 2 years ago

How would I add all these loop optimizations minus polly @RobRich999 Is it sufficient to copy the loop flags in your first post about it underneath if (is_official_build)? I can't tell if this affects everything as long as is_official_build = true, or if it only affects windows builds, as it is underneath if (is_win). This is at line 1869ish

RobRich999 commented 2 years ago

//chromium/src/build/config/compiler/BUILD.gn

For Windows:


if (is_win) {
  common_optimize_on_cflags = [
    "/Ob2",  # Both explicit and auto inlining.
    "/Oy-",  # Disable omitting frame pointers, must be after /O2.
    "/Zc:inline",  # Remove unreferenced COMDAT (faster links).
  ]
  if (!is_asan) {
    common_optimize_on_cflags += [
      # Put data in separate COMDATs. This allows the linker
      # to put bit-identical constants at the same address even if
      # they're unrelated constants, which saves binary size.
      # This optimization can't be used when ASan is enabled because
      # it is not compatible with the ASan ODR checker.
      "/Gw",
    ]
  }
  common_optimize_on_ldflags = []

  common_optimize_on_cflags += [
    "-mllvm", "-extra-vectorizer-passes",
    "-mllvm", "-enable-cond-stores-vec",
    "-mllvm", "-slp-vectorize-hor-store",
    "-mllvm", "-enable-loopinterchange",
    "-mllvm", "-enable-loop-distribute",
    "-mllvm", "-enable-unroll-and-jam",
    "-mllvm", "-enable-loop-flatten",
    "-mllvm", "-interleave-small-loop-scalar-reduction",
    "-mllvm", "-unroll-runtime-multi-exit",
    "-mllvm", "-aggressive-ext-opt",
    "-mllvm", "-polly",
    "-mllvm", "-polly-detect-profitability-min-per-loop-insts=40",
    "-mllvm", "-polly-position=early",
    "-Xclang", "-Rpass-analysis=polly",
  ]

  common_optimize_on_ldflags += [
    "-mllvm:-extra-vectorizer-passes",
    "-mllvm:-enable-cond-stores-vec",
    "-mllvm:-slp-vectorize-hor-store",
    "-mllvm:-enable-loopinterchange",
    "-mllvm:-enable-loop-distribute",
    "-mllvm:-enable-unroll-and-jam",
    "-mllvm:-enable-loop-flatten",
    "-mllvm:-interleave-small-loop-scalar-reduction",
    "-mllvm:-unroll-runtime-multi-exit",
    "-mllvm:-aggressive-ext-opt",
  ]

  # /OPT:ICF is not desirable in Debug builds, since code-folding can result in
  # misleading symbols in stack traces.
  if (!is_debug && !is_component_build) {
    common_optimize_on_ldflags += [ "/OPT:ICF" ]  # Redundant COMDAT folding.
  }

  if (is_official_build) {
    common_optimize_on_ldflags += [ "/OPT:REF" ]  # Remove unreferenced data.
    # TODO(thakis): Add LTO/PGO clang flags eventually, https://crbug.com/598772
  }
}

For Linux, drop it down to a few more lines:


} else {
  common_optimize_on_cflags = []
  common_optimize_on_ldflags = []

  common_optimize_on_cflags += [
    "-mllvm", "-extra-vectorizer-passes",
    "-mllvm", "-enable-cond-stores-vec",
    "-mllvm", "-slp-vectorize-hor-store",
    "-mllvm", "-enable-loopinterchange",
    "-mllvm", "-enable-loop-distribute",
    "-mllvm", "-enable-unroll-and-jam",
    "-mllvm", "-enable-loop-flatten",
    "-mllvm", "-interleave-small-loop-scalar-reduction",
    "-mllvm", "-unroll-runtime-multi-exit",
    "-mllvm", "-aggressive-ext-opt",
    "-mllvm", "-polly",
    "-mllvm", "-polly-detect-profitability-min-per-loop-insts=40",
    "-mllvm", "-polly-position=early",
    "-Rpass-analysis=polly",
  ]

  common_optimize_on_ldflags += [
    "-Wl,-mllvm,-extra-vectorizer-passes",
    "-Wl,-mllvm,-enable-cond-stores-vec",
    "-Wl,-mllvm,-slp-vectorize-hor-store",
    "-Wl,-mllvm,-enable-loopinterchange",
    "-Wl,-mllvm,-enable-loop-distribute",
    "-Wl,-mllvm,-enable-unroll-and-jam",
    "-Wl,-mllvm,-enable-loop-flatten",
    "-Wl,-mllvm,-interleave-small-loop-scalar-reduction",
    "-Wl,-mllvm,-unroll-runtime-multi-exit",
    "-Wl,-mllvm,-aggressive-ext-opt",
  ]

  if (is_android) {
    # TODO(jdduke) Re-enable on mips after resolving linking
    # issues with libc++ (crbug.com/456380).
    if (current_cpu != "mipsel" && current_cpu != "mips64el") {
      common_optimize_on_ldflags += [
        # Warn in case of text relocations.
        "-Wl,--warn-shared-textrel",
      ]
    }
  }

I have not forgotten ya' about cross building. It might be a day or two before getting something posted there.

Alex313031 commented 2 years ago

Thanks!

RobRich999 commented 2 years ago

Welcome. :)

Cleaned up the Linux copy-and-paste. -Xclang is not needed for Linux building.

BTW, you can comment out the Rpass lines if desired. It displays lots of SCoP analysis data in the terminal. I sometimes glance at the data to get an idea about Polly optimization rates, but unless there is a actual technical need, there really is not much point unless ya' just like watching what Polly is doing while compiling.

Alex313031 commented 2 years ago

@RobRich999 Why drop it a few lines for linux. Wont setting this anywhere(ish) make the loop optimizations apply to everything? If not, then would the if (is_official_build) section exclude the cflags since that section only has common_optimize_on_ldflags += [ "/OPT:REF" ] and not common_optimize_on_cflags += [ $insertsomethinghere...

Also, I'm still not using polly, as I don't wanna have to compile llvm and point chromium to use it, so I just removed the four lines in cflags referencing polly. Also Thorium has gotten some more patches and I will be using these loop optimizations in the next build.

Alex313031 commented 2 years ago

@RobRich999 HELP! lol. Im getting these errors when compiling. clang (LLVM option parsing): for the --extra-vectorizer-passes option: may only occur zero or one times! clang (LLVM option parsing): for the --enable-cond-stores-vec option: may only occur zero or one times! clang (LLVM option parsing): for the --slp-vectorize-hor-store option: may only occur zero or one times! clang (LLVM option parsing): for the --enable-loopinterchange option: may only occur zero or one times! clang (LLVM option parsing): for the --enable-loop-distribute option: may only occur zero or one times! clang (LLVM option parsing): for the --enable-unroll-and-jam option: may only occur zero or one times! clang (LLVM option parsing): for the --enable-loop-flatten option: may only occur zero or one times! clang (LLVM option parsing): for the --interleave-small-loop-scalar-reduction option: may only occur zero or one times! clang (LLVM option parsing): for the --unroll-runtime-multi-exit option: may only occur zero or one times! clang (LLVM option parsing): for the --aggressive-ext-opt option: may only occur zero or one times!

RobRich999 commented 2 years ago

Read back a few posts. ;) There is an issue with passing duplicate cflags. Comment out these lines in these files like so:

//chromium/src/third_party/libgav1/BUILD.gn

# configs += [ "//build/config/compiler:optimize_max" ]

//chromium/src/third_party/zlib/BUILD.gn

# configs += [ "//build/config/compiler:optimize_speed" ]

You move the optimizations down for Linux because there is an OS conditional for "is_win" and "else."

Alex313031 commented 2 years ago

@RobRich999 Ahh I see One thing though, you said "Clang will ignore many duplicates, but not duplicates for Polly" , but this happens with the loop optimizations even without polly. Will be refactoring my BUILD.gn and adding those zlib and libgav1 caveats to thorium repo. Thanks as always, I already have 8 users on reddit, and I couldn't have done it without you.

RobRich999 commented 2 years ago

Congrats. :)

I more or less saying Polly does not pass as duplicates, and more like YMMV with others. There are lots of duplicates clang/clang-cl is ignoring when building Chromium, though yeah, those extra optimization passes apparently are not part of them. Makes it "fun" figuring out what does and does not work with duplicates in LLVM.

Alex313031 commented 2 years ago

@RobRich999 In a week I'm getting a Z97 board and 4970K, with a over specced cooler so that I can hopefully run it at 4.6Ghz all cores or more, and speed up my Thorium & ChromiumOS development by about 40%, as well as build natively for AVX2. Also had a new drive come in this week that can hold all my personal stuff, as well as the ~220gb that chromium,chromiumos,and chromiumos built images will take up, so I dont have to have multiple 250gb + 250gb + 120gb drives just to do my daily stuffz. Probably not important to post here but I is an excited nerd.

RobRich999 / Chromium_Clang

Chromium Build Discussion #26

"-Xclang", "-Rpass-analysis=polly",