bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.06k stars 4.04k forks source link

Skyframe invalidation with disk or remote cache #22367

Open fmeum opened 4 months ago

fmeum commented 4 months ago

Description of the bug:

When using a disk cache (possibly also a remote cache), Bazel repeatedly marks output files obtained from the cache as dirty to trigger a download, but then doesn't download the files, so they are still missing on the next invocation.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  1. Patch Bazel with the diff further below.
  2. Repeatedly run baze --nosystem_rc --nohome_rc build --disk_cache=some_path //... on some project with Java targets.

See lines such as:

artifact is dirty: File:[[<execution_root>]bazel-out/darwin_arm64-fastbuild/bin]jeydtzir/libjeydtzir-hjar.jdeps RemoteFileArtifactValueWithExpiration{digest=0x7FACF6298A30C6CB373A988E3A3683CD3E8E9C46C59ECC570F089867234A3FEF, size=23, locationIndex=1, materializationExecPath=null, expireAtEpochMilli=1715713661059}
Invalidating: [ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//vkprdlnk:vkprdlnk, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData0{actionLookupKey=ConfiguredTargetKey{label=//wztejdxc:wztejdxc, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=0}, ActionLookupData0{actionLookupKey=ConfiguredTargetKey{label=@@rules_java~//toolchains:platformclasspath, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=0}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//jeydtzir:jeydtzir, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData0{actionLookupKey=ConfiguredTargetKey{label=//vkprdlnk:vkprdlnk, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=0}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//mynbiqpm:mynbiqpm, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//zjplsgqe:zjplsgqe, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=@@rules_java~//toolchains:platformclasspath, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}, ActionLookupData0{actionLookupKey=ConfiguredTargetKey{label=//jeydtzir:jeydtzir, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=0}, ActionLookupData1{actionLookupKey=ConfiguredTargetKey{label=//wztejdxc:wztejdxc, config=BuildConfigurationKey[b85621dae172f731976165fe86631f99030f401e70ed2b02d58c69595e5ea36d]}, actionIndex=1}] com.google.devtools.build.skyframe.InvalidatingNodeVisitor$DirtyingInvalidationState@104d38f9

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

diff --git a/src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java b/src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java
index 60a08ddf68..363c5291af 100644
--- a/src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java
+++ b/src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java
@@ -539,6 +539,7 @@ public class FilesystemValueChecker {
     for (Map.Entry<Artifact, FileArtifactValue> entry : actionValue.getAllFileValues().entrySet()) {
       if (artifactIsDirtyWithDirectSystemCalls(
           knownModifiedOutputFiles, remoteArtifactChecker, entry, modifiedOutputsReceiver)) {
+        System.err.println("artifact is dirty: " + entry.getKey() + " " + entry.getValue());
         isDirty = true;
       }
     }
@@ -551,6 +552,8 @@ public class FilesystemValueChecker {
           tree.getChildValues().entrySet()) {
         if (artifactIsDirtyWithDirectSystemCalls(
             knownModifiedOutputFiles, remoteArtifactChecker, childEntry, modifiedOutputsReceiver)) {
+          System.err.println(
+              "tree artifact entry is dirty: " + childEntry.getKey() + " " + childEntry.getValue());
           isDirty = true;
         }
       }
@@ -572,6 +575,7 @@ public class FilesystemValueChecker {
       if (shouldCheckTreeArtifact(sortedKnownModifiedOutputFiles.get(), treeArtifact)
           && treeArtifactIsDirty(treeArtifact, entry.getValue())) {
         // Count the changed directory as one "file".
+        System.err.println("tree artifact entry is dirty: " + treeArtifact);
         modifiedOutputsReceiver.reportModifiedOutputFile(
             getBestEffortModifiedTime(treeArtifact.getPath()), treeArtifact);
         isDirty = true;
diff --git a/src/main/java/com/google/devtools/build/skyframe/EagerInvalidator.java b/src/main/java/com/google/devtools/build/skyframe/EagerInvalidator.java
index f8f2c0d560..07988108b3 100644
--- a/src/main/java/com/google/devtools/build/skyframe/EagerInvalidator.java
+++ b/src/main/java/com/google/devtools/build/skyframe/EagerInvalidator.java
@@ -82,6 +82,7 @@ public final class EagerInvalidator {
       DirtyAndInflightTrackingProgressReceiver progressReceiver,
       InvalidationState state)
       throws InterruptedException {
+    System.err.println("Invalidating: " + diff + " " + state);
     DirtyingNodeVisitor visitor =
         createInvalidatingVisitorIfNeeded(graph, diff, progressReceiver, state);
     if (visitor != null) {
fmeum commented 4 months ago

cc @tjgq @coeuvre

coeuvre commented 4 months ago

Probably this: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/skyframe/FilesystemValueChecker.java;l=121;drc=e03de754ec15096fcd1f701ea0b2d4c09c4b97a1

Can you repro with --noexperimental_merged_skyframe_analysis_execution?

coeuvre commented 4 months ago

so they are still missing on the next invocation.

Are these outputs should be downloaded (e.g. toplevel outputs, or specified by --remote_download_regex)?

If Bazel marks an action dirty to trigger a download but then doesn't download it because it shouldn't, then it's a performance issue. Otherwise, it's a correctness issue.

fmeum commented 4 months ago

Can you repro with --noexperimental_merged_skyframe_analysis_execution?

I can't.

Are these outputs should be downloaded (e.g. toplevel outputs, or specified by --remote_download_regex)?

No, these outputs should not be downloaded (just hjars, jdeps files, ...), so I think this is just a performance issue.

fmeum commented 4 months ago

This is the root cause: https://github.com/bazelbuild/bazel/blob/43ad74bec433c1923e2ce78605ea04cac0cdb324/src/main/java/com/google/devtools/build/lib/buildtool/ExecutionTool.java#L328-L337

Note the RemoteArtifactChecker.IGNORE_ALL, which means that all remote artifacts will be marked as dirty. Happy to work on this, are there any ideas of what would need to be done to improve this? CC @joeleba

coeuvre commented 4 months ago

skymeld calls skyframeExecutor.detectModifiedOutputFiles when the first toplevel target is analyzed. However, the RemoteArtifactChecker for BwoB needs full analyze result to be able to correctly determine which action should be marked as dirty because it records the path of toplevel outputs in a trie: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/RemoteOutputChecker.java;l=128;drc=6f48f1c2b3bb73768b9ff15ce6698e21eddc503a (there are some skymeld only code, but is for clean build). Otherwise, actions for other toplevel targets will not be invalidated. So we decided to use RemoteArtifactChecker.IGNORE_ALL for now.

One potential solution I have discussed with @joeleba offline is to let skymeld calls skyframeExecutor.detectModifiedOutputFiles for each toplevel target so that the RemoteArtifactChecker.IGNORE_ALL can be replaced with the one from BowB.

coeuvre commented 4 months ago

This is the root cause:

Argh, I intended to post this link but somehow my clipboard is corrupted.

joeleba commented 4 months ago

This is a known limitation of Skymeld + BwoB. It's a tradeoff essentially: having both skymeld + bwob improves clean builds' performance, but comes with a performance penalty for incremental builds. Luckily the penalty only comes from repeating the skyframe work, since the have other layers of caches for actions.

In the current state, detectModifiedOutputFiles is called once before any action execution in the build. At this point, it's not possible to construct the full picture of what's dirty or not in the RemoteArtifactChecker (only available after the full analysis).

To resolve this, we would need to do detectModifiedOutputFiles incrementally, as the information for each top level target/aspect becomes available.

brentleyjones commented 3 months ago

We reproduced this with just the remote cache. Can this have a higher than P3 priority?

coeuvre commented 3 months ago

Raising it to P2 but we probably won't have time to work on the fix until next month or two. In the mean while, PR is welcome!

brentleyjones commented 1 month ago

Friendly ping on this.

coeuvre commented 1 month ago

I have a proper fix and should be able to submit it in a few days.

coeuvre commented 1 month ago

@bazel-io fork 7.4.0

coeuvre commented 1 month ago

This is fixed both in master and release-7.4.0 branches.

@brentleyjones Can you verify whether it improves your builds?

brentleyjones commented 1 month ago

Thanks! I'll have to wait for an RC to start validation.