bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.06k stars 4.04k forks source link

Defining C++ repos with module 'use_repo_rule' impacts windows (msvc/clang-cl) compile speed by up to 5x - tildes? #22865

Closed peakschris closed 2 months ago

peakschris commented 3 months ago

Description of the bug:

We have had a curious regression in compile speed for the last 3 weeks. Whilst previously our compiles loaded 40 local cpus to 100%, after the regression cpus were only working at ~25% and compile times increased by 5x (3h to 15h)

I have spent ages trying to figure this out. Tried multiple windows machines, msvc and clang-cl, profiled the build, tested the filesystem, tried a ramdisk, multiple bazel versions...

I've finally narrowed it down to a reproducible example. It happened when I converted all our C++ external repos from workspace to module via use_repo_rule. I think it may be due to the tilde's (~) in the include paths. Perhaps it's an MSVC bug, but it also impacts clang-cl builds which is curious. When I run the Maybe it is a windows bug?

I can't share the code, but can share what we changed and the performance characteristics. This is a tiny example of low level code so only shows a 40% degradation. With 50 repos in lower level code the impact is much larger.

With workspace repo (30s) image The flat peak is the time when the compiles are all happening and cpus are loaded 100%

With one repo defined with use_repo_rule in MODULE.bazel (44s): image The CPUs never load at 100%

Build file, same in before and after:

cc_import(
    name = "OurTools",
    interface_library = "ourtools.lib",
    shared_library = "ourtools.dll",
    deps = [
        ":headers",
    ],
)

cc_library(
    name = "headers",
    hdrs = glob([
            "include/**/*.h", 
            "include/**/*.hpp", 
            "include/**/*.c"]),
    includes = [
            "include",
            "include/sub1/common",
            "include/sub2/include"],
)

Before: OurTools declared with a rule that calls download_and_extract in WORKSPACE

load("@our_toolchain//:tools/artifact_rules.bzl", "our_artifact_with_plat")
def define_OurTools_repo():
    our_artifact_with_plat(
        name = "OurTools",
        version = "2.1.0",
        build_file = "//src/repo/bazel/toolbox/build_files/wntx64:OurTools_BUILD.bazel",
        sha256 = "021fb717579ec6c3cc2f9eb8bacd8285801ac116e827ca7f430c0960fd0a2378",
    )

After: this is in MODULE.bazel:

our_artifact_with_plat = use_repo_rule("@our_toolchain//:tools/artifact_rules.bzl", "our_artifact_with_plat")
our_artifact_with_plat(
    name = "OurTools",
    version = "2.1.0",
    build_file = "//src/repo/bazel/toolbox/build_files/wntx64:OurTools_BUILD.bazel",
    sha256 = "021fb717579ec6c3cc2f9eb8bacd8285801ac116e827ca7f430c0960fd0a2378",
)

The reason this may cause a compile time difference is the include path. We use params files, the relevant differences are:

Workspace

/Iexternal/OurTools
/Ibazel-out/x64_windows-opt/bin/external/OurTools
/Iexternal/OurTools/include
/Ibazel-out/x64_windows-opt/bin/external/OurTools/include
/Iexternal/OurTools/include/icu/common
/Ibazel-out/x64_windows-opt/bin/external/OurTools/include/sub1/common
/Iexternal/OurTools/include/xalanc/include
/Ibazel-out/x64_windows-opt/bin/external/OurTools/include/sub2/include

Module & use_repo_rule:

/Iexternal/_main~_repo_rules~OurTools
/Ibazel-out/x64_windows-opt/bin/external/_main~_repo_rules~OurTools
/Iexternal/_main~_repo_rules~OurTools/include
/Ibazel-out/x64_windows-opt/bin/external/_main~_repo_rules~OurTools/include
/Iexternal/_main~_repo_rules~OurTools/include/icu/common
/Ibazel-out/x64_windows-opt/bin/external/_main~_repo_rules~OurTools/include/sub1/common
/Iexternal/_main~_repo_rules~OurTools/include/xalanc/include
/Ibazel-out/x64_windows-opt/bin/external/_main~_repo_rules~OurTools/include/sub2/include

Which category does this issue belong to?

C++ Rules, External Dependency

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I can't share my reproducer as it is proprietary.

Which operating system are you running Bazel on?

Windows 10

What is the output of bazel info release?

7.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

This is not a regression

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

fmeum commented 3 months ago

Could you share a Starlark profile of the builds? If you are worried about leaking target names, sharing a redacted screenshot of each would already be very helpful.

peakschris commented 3 months ago

I was wondering if it is due to the tildes ~ which are a special character in windows 8.3. filenames. I've disabled 8.3 filenames using fsutil 8dot3name set 3 but we have no way of knowing if any special code still executes in the windows kernel when it comes across a request for a file containing ~

@fmeum happy to provide a redacted profile, but starlark_cpu_profile is disabled on windows (https://github.com/bazelbuild/bazel/issues/13748). Is there another way I can provide some data?

Another thought; would it be hard to make me a trial bazel build using some other character that is not ~ - e.g a x? I assume the character is not hardcoded in many places?

peakschris commented 3 months ago

We have some new_local_repositories defined inside module extensions that appear to end up with a simple name on the filesystem and don't trigger this issue. Are there other ways to get simple names besides new_local_repository?

peakschris commented 3 months ago

Here is some analysis from Windows Performance Analyzer, I used it to capture each build. I think this shows that the extra time is spent in the compiler or OS, and not bazel. I think it is bazel's directory names that are the issue though ;-)

Workspace: image

Module: image

You can see that "parsing" time with module include path increases from 7s to 20s

You can also see the difference reflected in C1DLL stage of the compiler: Workspace image

Module image

fmeum commented 3 months ago

@peakschris I created this branch based on 7.2.1rc2 that uses . as a separator instead of ~. Can you build Bazel from that branch and test it? I don't have access to a Windows machine to compile a binary for you right now.

cc @meteorcloudy

peakschris commented 3 months ago

Thank you! This isn't working:

D:\udu\units\cb_0523_bazel2>bazel-mod-dot build //src/server/soa_framework/common/cpp/...
INFO: repository @@our_toolchain.' used the following cache hits instead of downloading the corresponding file.
 * Hash 'b5d45176a23aaa3173d2b6560df57283490e6756253ed4fb11b3855c1178eae1' for <internal-url-redacted>
If the definition of 'repository @@our_toolchain.' was updated, verify that the hashes were also updated.
ERROR: <builtin>: fetching http_archive rule //:our_toolchain.: java.io.IOException: http_archive rule //:our_toolchain. must create a directory
ERROR: Error computing the main repository mapping: error during computation of main repo mapping: http_archive rule //:our_toolchain. must create a directory

I wonder if windows has some oddities around trailing dots in directory names:

D:\workdir>mkdir "our_toolchain."

D:\workdir>dir our_toolchain*
 Volume in drive D is Data
 Volume Serial Number is 982B-CBB1

 Directory of D:\workdir

06/24/2024  06:18 AM    <DIR>          our_toolchain

Dots in the middle seem fine, but dots at the end cause issues:

D:\workdir>mkdir "_main.our_toolchain"
D:\workdir>dir _main*
 Volume in drive D is Data
 Volume Serial Number is 982B-CBB1

 Directory of D:\workdir

06/24/2024  06:20 AM    <DIR>          _main.our_toolchain
               0 File(s)              0 bytes
               1 Dir(s)  161,841,446,912 bytes free

D:\workdir>mkdir "_main.our_toolchain."
A subdirectory or file _main.our_toolchain. already exists.
peakschris commented 3 months ago

"+" seems to work on windows anywhere in filename without quoting in shell:

D:\workdir>mkdir _main+our_toolchain+

D:\workdir>dir _main+our_toolchain+
 Volume in drive D is Data
 Volume Serial Number is 982B-CBB1

 Directory of D:\workdir\_main+our_toolchain+

06/24/2024  06:31 AM    <DIR>          .
06/24/2024  06:31 AM    <DIR>          ..
               0 File(s)              0 bytes
               2 Dir(s)  161,841,442,816 bytes free

D:\workdir>dir _main*
 Volume in drive D is Data
 Volume Serial Number is 982B-CBB1

 Directory of D:\workdir

06/24/2024  06:31 AM    <DIR>          _main+our_toolchain+
peakschris commented 3 months ago

I will make a local change to + and retry, now I know where the code changes should go

fmeum commented 3 months ago

+ looks promising. @Wyverald Do you happen to remember whether we explicitly decided against + at some point?

peakschris commented 3 months ago

I tested:

It fixes the issue.

fmeum commented 3 months ago

@bazel-io flag

fmeum commented 3 months ago

@peakschris Thanks for testing this. It looks like we need to switch to a different scheme.

peakschris commented 3 months ago

@fmeum you're welcome, glad to get to the bottom of this. Would this be a breaking change, behind a flag?

As an aside, when I'm building bazel should I expect Executing genrule //src:package-zip_jdk_minimal to take 1800s?

fmeum commented 3 months ago

Technically it wouldn't be as the docs clearly state that the particular naming scheme is an implementation detail, but due to limitations of the API users started to rely on it. Since the regression is so severe and silent, I do think that we need to fix it even in 7.2.

I've noticed this target being very slow to build on Windows, but not to that extent (a few minutes on my regular laptop).

iancha1992 commented 3 months ago

@bazel-io fork 7.3.0

peakschris commented 3 months ago

There are two further lines that need changing in src\main\java\com\google\devtools\build\lib\bazel\bzlmod\ModuleExtensionId.java

meteorcloudy commented 3 months ago

@peakschris Thank you so much for catching this and digging into the root cause.

I've disabled 8.3 filenames using fsutil 8dot3name set 3 but we have no way of knowing if any special code still executes in the windows kernel when it comes across a request for a file containing ~

So we can confirm that disabling 8.3 filenames on the volume doesn't fix this bug, right?

peakschris commented 3 months ago

@meteorcloudy that's correct, disabling 8.3 filenames does not fix this bug

Wyverald commented 3 months ago

+ looks promising. @Wyverald Do you happen to remember whether we explicitly decided against + at some point?

We passed on + since it's a valid character in version strings (as the delimiter before the "BUILD" segment), although effectively nobody uses it (in fact literally no module versions include the character + on BCR right now).

Since + can only appear at most once in the version string, and it cannot be used in module names, Starlark identifier names (i.e. extension names), or user-provided repo names, I think we can safely use it to replace ~ without running into conflicts.

Technically it wouldn't be as the docs clearly state that the particular naming scheme is an implementation detail, but due to limitations of the API users started to rely on it. Since the regression is so severe and silent, I do think that we need to fix it even in 7.2.

To be clear, this is a regression in all of Bzlmod, not specifically in 7.2 (or even 7.*), right? IMO this is enough of a breaking change that we need to wait until 8.0, preferably with an incompatible flag so people could test it beforehand. (EDIT: as a concession, I think it's fair to immediately flip and graveyard the flag in 8.0, so that we don't need to maintain the ~ syntax for another LTS cycle. IMO this is acceptable since we've repeatedly stated that the scheme is not an API people should rely on.)

Wyverald commented 3 months ago

We passed on + since it's a valid character in version strings (as the delimiter before the "BUILD" segment), although effectively nobody uses it (in fact literally no module versions include the character + on BCR right now).

I just realized that there's another reason why nobody ever used + in the version... Pre-7.1.0, when we had the version in the canonical repo name, @@foo~1.0.0+build was actually not a valid repo name as + was never a legal character in repo names (code)... So if we had any module with + in its version in the final resolved dep graph, Bazel would crash when it tries to create a RepositoryName object for it.

This only makes the switch to + safer, so I suppose it's good news.

fmeum commented 2 months ago

To be clear, this is a regression in all of Bzlmod, not specifically in 7.2 (or even 7.*), right? IMO this is enough of a breaking change that we need to wait until 8.0, preferably with an incompatible flag so people could test it beforehand. (EDIT: as a concession, I think it's fair to immediately flip and graveyard the flag in 8.0, so that we don't need to maintain the ~ syntax for another LTS cycle. IMO this is acceptable since we've repeatedly stated that the scheme is not an API people should rely on.)

Yes, I agree that flipping this should be done in 8 only. But we should get the flag out soon so that Windows folks can adopt it, either in 7.3.0 or even a 7.2.2 release.

Wyverald commented 2 months ago

OK, I have a test commit that changes the ~ in repo names to +. Ironically, it passes CI on every platform, except Windows, with this failure: https://storage.googleapis.com/bazel-untrusted-buildkite-artifacts/0190d899-3bd1-4d10-acc0-b6bb788bd4c9/src%5Ctest%5Cshell%5Cbazel%5Cbazel_bootstrap_distfile_test%5Ctest.log

After hours of debugging and tearing my hair out, it seems it's due to the copy command in Windows using + in a truly weird syntax:

> copy /?
Copies one or more files to another location. 

COPY [/D] [/V] [/N] [/Y | /-Y] [/Z] [/L] [/A | /B ] source [/A | /B]
     [+ source [/A | /B] [+ ...]] [destination [/A | /B]]

  source       Specifies the file or files to be copied.
  /A           Indicates an ASCII text file.
  /B           Indicates a binary file.
  /D           Allow the destination file to be created decrypted
  destination  Specifies the directory and/or filename for the new file(s).
  /V           Verifies that new files are written correctly.

[...]

To append files, specify a single file for destination, but multiple files
for source (using wildcards or file1+file2+file3 format).

That is, despite dir a+b.txt finding the right file, copy a+b.txt c.txt is actually similar to cat a b.txt > c.txt on Unix (as in, it concats multiple files). copy "a+b.txt" c.txt works as expected.

This was only caught by our dist archive integration test because this BUILD file in zlib uses the copy command in a genrule.cmd_bat, presumably to avoid a dependency on Bash on Windows:

https://github.com/bazelbuild/bazel-central-registry/blob/abc4722b9f53b9bc58790c7e5536621b3bbf8a23/modules/zlib/1.3.1/patches/add_build_file.patch#L70

Now, this is definitely fixable (by quoting appropriately), but it does demonstrate that + is not fully safe. Luckily, this weird parsing of + seems limited to the copy command; move, for example, doesn't treat + as a special character (although, funnily enough, it does treat the comma , as a special character -- Unix mv a b c is spelled move a,b c on Windows...).


As a footnote, here's an example usage of the copy command that truly blows my mind:

Combining each file that has a .txt extension with its corresponding .ref file creates a file with the same file name, but with a .doc extension. The Copy command combines file1.txt with file1.ref to form file1.doc, and then the command combines file2.txt with file2.ref to form file2.doc, and so on. For example, type:

copy *.txt + *.ref *.doc
peakschris commented 2 months ago

Wyverald, sounds like an epic debug session! Hope you've recovered... thank you for your work on this.

peakschris commented 2 months ago

Just to say that I've tested this flag in 7.3.0rc1, and it is working for our monorepo. The performance issue is resolved. Thank you for your work on this, and the future work, much appreciated.

meteorcloudy commented 2 months ago

@peakschris Thanks for reporting and diagnose this issue!