Important race condition when using remote cache without remote build

mmorearty commented 7 years ago

Description of the problem / feature request / question:

Our org has about 50 engineers using Bazel, with a remote cache but without remote build. About once every 3-4 weeks, I would hear of a case of someone's build that was failing in strange ways when the remote cache was turned on, but would succeed when the cache was off.

I finally got a repro case, and was able to figure out what was happening. Bazel's remote spawn strategy does its work in this order:

Compute the hash of the action (which includes the hash of all the input files)
Check if the remote cache contains the action result matching that hash
If it does, then download it
If it doesn't, then do a local build and upload the action result

But if the user makes any changes to the sources (e.g. saves a file, or uses git to check out a different branch of their code) after step 1, but before the build in step 3, then you will end up with an incorrect (corrupt) entry in the cache: Its hash matches the input files when step 1 executed, but its contents do not match what would have been built from those input files — its contents match what was built from the modified input files.

From that point on, all other developers on the team who try to build that same code will download the incorrect build results.

Interestingly, I think this cannot cause cache problems if you are using remote cache WITH remote build. In that scenario, Bazel's SimpleBlobStoreActionCache.uploadFileContents() will first read the file into memory, then compute its hash, and upload the already in-memory blob. So the file contents in the cache are guaranteed to be accurate (the source file's hash matches the source file's contents). I haven't tried it, but I imagine that the worst that could happen — and I'm not even sure if this could happen — is that the remote build might fail because it can't find the input files it wants. No serious harm done.

If possible, provide a minimal example to reproduce the problem:

This BUILD file has a rule called :race which, when built, will first sleep for a few seconds, and then will simply cat foo.in to foo.out:

genrule(
    name = "race",
    srcs = ["foo.in"],
    outs = ["foo.out"],
    cmd = "sleep 10; cat $(SRCS) > $@"
)

Make a file foo.in: echo 1 > foo.in

Start a build with a command line similar to this one (for this test, I have nginx running on localhost acting as a remote cache; you will need to set up some sort of similar remote cache):

bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 build --spawn_strategy=remote --remote_rest_cache=http://localhost:8000/bazel-cache/race :race

While the build is happening — during the sleep 10 — in another shell, change the contents of foo.in: echo 2 > foo.in

The build will complete, and bazel-genfiles/foo.out will, not surprisingly, contain 2. That's not a bug. But then:

Do a clean: bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 clean
Reset foo.in to its old contents: echo 1 > foo.in
Do another bazel build, with the same command line as above.

At this point, bazel-genfiles/foo.out will contain 2, because that's what Bazel found in the remote cache. This is definitely wrong.

Environment info

Operating System:

OSX; probably on Linux too, I haven't tested it

Bazel version (output of bazel info release):

0.5.2

ola-rozenfeld commented 7 years ago

That's awesome, thank you for reporting it!

ola-rozenfeld commented 7 years ago

Wait, but this is a race condition in local bazel caching as well. If you compile :race without any remote_cache, chnge foo.in mid-flight, then after the build change it back and and try to recompile :race again, it will say that the target is up-to-date, but give you incorrect results. I suspect the right thing to do would be to abort the build with an error as soon as Bazel notices that one of the input files was changed mid-compile (where changed means changed timestamp, disregarding the contents), before putting the results into any cache (remote or local).

mmorearty commented 7 years ago

I see you're right, it's a local bazel caching issue as well, I hadn't noticed that but it makes sense.

I agree with your suggestion of checking timestamps. I can't think of a better way to handle it than that.

When I was thinking this was only a problem with the remote cache, I thought bazel could just silently decide not to upload the incorrect files to the remote cache, but continue the local build. But since, as you say, the local cache is wrong too, maybe the build has to be aborted. Ugh. That feels undesirable, but I'm not sure what else to do. I suppose it could requeue the action and run it again, with a limit of 3 tries or something? That might be nice, but it's getting pretty fancy.

I'm also trying to figure out if there are more complicated scenarios that need to be considered. For example, suppose you have Package A, which compiles Input A, and Package B, which compiles Input B. A build is going at the same time that a git checkout is taking place. So Package A builds correctly, no errors; then, after it's done, Input A changes; and then Package B starts building. Does this mean that the build for Package B needs to recursively check the timestamps of all ancestor inputs in order to avoid corrupting the cache? I don't know.

mmorearty commented 7 years ago

As a longer-term solution, it might be interesting to explore copy-on-write filesystems: If sandboxed builds made copy-on-write copies of the input files instead of using symlinks, then the process of creating the copies should be fast and shouldn't use much additional disk space, but I think it would protect us against this (as long as the hashes or timestamps are double-checked after the copies are made).

But this is sort of exotic, and only some newer filesystems support this, such as macOS APFS, Windows ReFS, and Linux ZFS (I'm saying this based on a quick Google search, I may be wrong about support in those filesystems).

buchgr commented 7 years ago

cc: @ulfjack

ulfjack commented 7 years ago

Thanks for the report. Well spotted.

Technically, yes, there is the possibility of a race in the local case. However, it's sufficiently unlikely to happen. We can't make it completely reliable in all cases since it runs on your local machine and most people have root access to their local machines, so can do whatever they want.

The remote cache is a bigger concern, because the bad result from one machine can poison the results for other users. Note that any user can already poison the results for any other user with a shared cache. You should only use a shared cache if you completely, fully trust everyone in your team; anyone with write access to the shared cache can own everyone else's machine. As such, using a user-writable shared cache is something that we'll warn people against. I apparently didn't add that to the documentation yet.

The compromise is to have a cache that's written by an automated process, e.g., a CI system, which very few people have access to, and configure the clients for read-only access. That is, if you don't do remote execution on machines that are locked down.

That said, we might want to double-check the checksums of the files before we upload an entry to the remote cache, as modifying source files during a build is a common-enough operation that this can easily happen by accident.

If we enable --watchfs by default, we could check for relevant events for the relevant files between starting an action and writing the cache entry. The watcher APIs don't guarantee timely delivery, so an event that happens before writing a cache entry could still be delayed until after (so there's still a race), but it would be much less likely to happen. It also won't work for network file systems, because the relevant APIs usually don't.

I think this is going to require some infrastructure to make it possible to check for file change events in the right places. We could maybe double-check the checksum before uploading the cache entry, but that could also be prohibitively expensive.

Here's what isn't going to work:

we can't stop the build on any modified file - a lot of people leave Bazel running in a loop, or realize that they still need to change something after they started the build; it usually doesn't cause any problems, and we'd get a lot of angry users if we broke the build every time that happens
we're not going to copy the input files for every action; too expensive (we tried)
we can't use hard-links for every action; too expensive (we tried)
we can't rely on the file system - most systems don't have copy-on-write file systems readily available; relying on esoteric file systems is not an option

ola-rozenfeld commented 7 years ago

Checking the checksums doesn't eliminate the race condition in general (see provided example, where the file changes from 1 to 2 and back to 1). We need to check the timestamps, which might actually be cheaper. How about if we only add this change: check the timestamps of all input files before writing the result to a remote cache, and if we see changes from after the action execution started, we skip setting the action result in the remote cache? Perhaps that won't be too expensive -- I can try and profile that and see how bad it is. I believe that the cache poisoning should not be "transitive", due to how our Merkle trees work -- the next actions will depend on inputs with particular digests, so we shouldn't worry about propagating the errors, right?

The local cache problem means that the remote cache problem will be reproducible for a particular user -- their local cache will be hit first, return an incorrect result, and the build will continue wrongly from there. So users who run Bazel in a loop will need to remember to do a bazel clean often...

ulfjack commented 7 years ago

Sure, I think using the watcher is actually a better plan. Absent that, we'd actually check the ctime, which we already record in some cases (but maybe not all?).

Uhm, what? There's no problem with the local cache. The case that you describe is something we've never observed in practice, and anyone running bazel clean is doing it wrong.

buchgr commented 7 years ago

That said, we might want to double-check the checksums of the files before we upload an entry to the remote cache, as modifying source files during a build is a common-enough operation that this can easily happen by accident.

I think before uploading we should read the whole file into memory, then compute the checksum, then do the upload. This at least guarantees that the hash and the contents match.

Note that any user can already poison the results for any other user with a shared cache.

I agree and disagree. It seems to me that anyone would want to run with a content-addressable storage cache that computes the hash server-side and doesn't trust the client's hash. So unless a malicious user manages to create a hash-collision, things should be fine. The remote cache can use the hash as a hint and return an error if its own computation and the client provided hash mismatch, but should never just accept the client's hash.

ulfjack commented 7 years ago

We aren't uploading the input files to the remote cache. We do upload the output files, but the remote worker has code to double-check that the checksum matches.

The action cache is not a pure CAS. The action results are entries from action hash to action result. You're right that it's impossible to poison a CAS (if the remote server checks the checksums), but that's not the problem. I think it's not ideal that we store CAS and non-CAS entries in exactly the same way - implementations might need to distinguish between the two and use different storage mechanisms. It's on my TODO list to change the SimpleBlobStore interface (or introduce a new one) that allows implementations to store these separately, and then also update the Ondisk implementation to put the CAS and non-CAS entries into different directories.

Also, I apologize for my last comment. Thanks to Ola for pointing out the flaws in my not-well-thought-through previous post, and I agree that we should track file modification time. I'm still worried about costs, since it requires one additional stat per input file per action, but it's probably ok.

ulfjack commented 7 years ago

The action hash is a hash of the action description, not of the action result. If we knew the action result already, we wouldn't need to ask the cache.

ulfjack commented 7 years ago

I'm convincing myself that we shouldn't make a release without a bug fix or workaround for this issue.

buchgr commented 7 years ago

The 0.5.3 release will contain a bug fix by @ola-rozenfeld that should lower the chances of hitting the remote cache race by checking the input files ctime before and after execution. However, as @ulfjack explained to me, this will not fix the bug. The problem is that we need to compute the input's ctime at the time the action's hash is computed, which is usually much before the action is executed. So this fix still doesn't address this problem, but just reduces the time window it can. We don't know whether this is significant tough. @ulfjack mentioned we need to do some internal refactoring to be able to fix it for real. This will NOT be done in time for the 0.5.3 release.

mmorearty commented 7 years ago

Thanks for your work on this. @ola-rozenfeld's fix addresses the example I gave in my original description, but it's still not difficult to run into this problem with only slightly more complicated scenarios. This is real, and I think it's likely that my team would hit this if we were using the current version of Bazel. (We are currently using a fork of Bazel with my own fix, similar to what I describe below.) Luckily, I think a fix similar to Ola's is not too difficult (although unfortunately it wouldn't be as localized as her fix was).

The problem is that the current fix gets the ctime immediately before the action runs, and then immediately after the action runs; but often, the actual digests of the inputs files were computed much earlier during Bazel's execution.

The real-world scenario is: The user starts a build (with local compilation but with remote cache) which will take "a little while" (let's say, at least 30-60 seconds). While the build is going, the user saves a file in their editor — a file that Bazel has not yet used in any action (but Bazel has already computed its digest, because that happened fairly early in Bazel's execution). The result is cache corruption.

One way to address this would be: Immediately before any file's digest is computed, get its ctime; and save that somewhere (probably alongside the digest somehow). Then, much later, immediately after an action has executed, check if the ctimes of any of its input files have changed.

Here's a repro case for this scenario, structured quite similarly to the one above, except with two actions instead of one:

Make this BUILD file. It has two targets:

lib, which waits ten seconds and then copies lib.in to lib.out; and
main, which depends on lib, and also waits ten seconds and then copies both lib.out and main.in to main.out

genrule(
    name = "main",
    srcs = [":lib", "main.in"],
    outs = ["main.out"],
    cmd = "sleep 10; cat $(SRCS) > $@",
)

genrule(
    name = "lib",
    srcs = ["lib.in"],
    outs = ["lib.out"],
    cmd = "sleep 10; cat $(SRCS) > $@",
)

Make the files lib.in and main.in:

echo lib 1 > lib.in
echo main 1 > main.in

Start a build of :main with a command line similar to this one (for this test, I have nginx running on localhost acting as a remote cache; you will need to set up some sort of similar remote cache):

bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 build -s --spawn_strategy=remote --remote_rest_cache=http://localhost:8000/bazel-cache/race :main

While the build is happening — during the first sleep 10 (in other words, while :lib is sleeping) — in another shell, change the contents of main.in: echo main 2 > main.in

The build will complete, and bazel-genfiles/main.out will, not surprisingly, contain:

lib 1
main 2

That's not a bug. But then:

Do a clean: bazel --host_jvm_args=-Dbazel.DigestFunction=SHA1 clean
Reset main.in to its old contents: echo main 1 > main.in
Do another bazel build, with the same command line as above.

At this point, bazel-genfiles/main.out will still have main 2 on its second line, because that's what Bazel found in the remote cache. This is wrong.

mmorearty commented 6 years ago

The best way I can think of to address this is to change FileSystem.getDigest() so that instead of returning just the digest (as a byte[]), it returns a small class that has two members: the digest, plus the ctime that the file had just before the digest was computed. This is unfortunate because it will ripple through a whole lot of code. I can't think of a better way to do it.

How are filesystems exposed internally in Bazel? If they are located through some sort of registry, then perhaps another way to do it would be, FileSystem.getDigest() wouldn't change, but instead it would keep track (in the FileSystem class) of the ctime just before the digest was computed. And then there can be a separate call, FileSystem.getDigestCtime() or something, to get the ctime that the file had just before the digest was computed.

buchgr commented 6 years ago

What do you think about recomputing the action hash (incl. source file hashes, affected merkle tree paths) before uploading to the remote cache?

The code changes would be limited to the remote package. It's not clear to me that this would be noticeable in performance. Furthermore, the cost of hashing could be hidden by concurrently uploading output artifacts and only uploading the action hash if it did not change.

ulfjack commented 6 years ago

@buchgr, I'm not sure what you're proposing here, but rehashing all the input files is likely prohibitively expensive. Just think of rehashing the JDK for every Javac action - not an option. The right thing is to use the ctime to check for file changes, and conservatively not cache if the ctime changed (or re-execute). Note that it's unsafe to upload a cache entry if you aren't sure that the inputs match what the compiler saw during the action.

hchauvin commented 6 years ago

Maybe I'm missing something about the race condition here, but the way we avoid corrupting the cache is to recompute the hash on the cache service for every upload and throw an error if there is a mismatch. This way, in the case the file changes locally between the computation of the hash and the call to the remote cache service, the build just fails hand has to be started again.

buchgr commented 6 years ago

Maybe I'm missing something about the race condition here, but the way we avoid corrupting the cache is to recompute the hash on the cache service for every upload and throw an error if there is a mismatch. This way, in the case the file changes locally between the computation of the hash and the call to the remote cache service, the build just fails hand has to be started again.

Ideally yes, however, when using remote caching we never upload the inputs, so the server can't verify it. Also, this would prevent people from just using nginx with webdav as a backend.

ulfjack commented 6 years ago

I have a fix and it turned out to be easier than expected (after several dead ends). You can find it here: https://bazel-review.googlesource.com/c/bazel/+/31250

I don't have any tests yet, but it might be reasonable to merge without. I'll try to merge it in the next few days.

mmorearty commented 6 years ago

Awesome, I did a quick test and your fix seems to work for me. Please note, it works when I used --experimental_remote_spawn_cache (which is what I actually use in my real build); it did not seem to have any effect when I used --spawn_strategy=remote. The latter never called RemoteSpawnCache.store(), where the check happens.

Also: I take it it's deliberate that RemoteSpawnCache.store() calls checkForConcurrentModifications() twice?

ulfjack commented 6 years ago

Thanks for the review, and no, it shouldn't call it twice. That was an accidental left-over from an intermediate state of the change. It's correct that it only works for experimental_remote_spawn_cache - I'll update this to also cover the spawn strategy.

ulfjack commented 6 years ago

@buchgr reported that he's gotten a lot of warnings with my patch.

ulfjack commented 6 years ago

The problem occurs if an action gets marked as dirty, but turns out to be an action cache hit. Then Bazel caches the old FileArtifactValue, which is incorrect.

ulfjack commented 6 years ago

Skyframe returns the old FileArtifactValue if the new value is equal to the old one, and I didn't update equals to also cover the ctime. So if the ctime changes but the digest is the same (e.g., on an action cache hit), then the SpawnActionRunner thinks the file was modified underneath.

mmorearty commented 6 years ago

0.10.0 is out now. Just double-checking, I take it this issue is not fixed in 0.10.0, correct?

ulfjack commented 6 years ago

That's right - the changes didn't make it into 0.10.0, and it's also disabled right now.

buchgr commented 6 years ago

@ulfjack with your equals method fix in, is there anything stopping us from enabling it again?

ulfjack commented 6 years ago

@buchgr not that I know of.

ulfjack commented 6 years ago

Maybe you can give it a try first?

guoshimin commented 6 years ago

Any update on this? Just ran into what seems to be the same issue as this with 0.15.2, with remote exec and cache.

beala-stripe commented 6 years ago

I noticed the bazel docs mention a useful flag that hasn't been mentioned in this thread yet:

  --[no]experimental_guard_against_concurrent_changes (a boolean; default: "false")
    Turn this off to disable checking the ctime of input files of an action
    before uploading it to a remote cache. There may be cases where the Linux
    kernel delays writing of files, which could cause false positives.

buchgr commented 6 years ago

@mmorearty are you using this flag at Asana? Is it working well for you?

jgavris commented 6 years ago

I've been rocking this flag since it was added, have yet to see the issue since.

ulfjack commented 6 years ago

Cool! Let's switch it on.

mmorearty commented 6 years ago

Yes, it’s been working perfectly for me too!

ola-rozenfeld commented 6 years ago

That flag was Ulf's fix referenced in https://github.com/bazelbuild/bazel/issues/3360#issuecomment-356528977. I guess we didn't mention the flag name because we set it to true by default at first, and then when Ulf discovered a problem with it he reverted the default to false.

buchgr commented 6 years ago

I am consistently able to reproduce the issue with a Bazel build, but haven't had time to dig deeper :. Thus I am somewhat reluctant to enable it by default. I think at the very least we should document it though?

ulfjack commented 6 years ago

@buchgr Are you able to reproduce with experimental_guard_against_concurrent_changes enabled?

uri-canva commented 5 years ago

What's the remaining issue with this flag? I'm considering running a cache locally on developer laptops instead of using the dir cache as discussed in https://github.com/bazelbuild/bazel/issues/5139.

buchgr commented 5 years ago

@uri-canva on our CI we have as of last week seen actions being marked as modified even though no files have been modified on macOS. I did not have time to look into the issue yet. I am not aware of any problems on Linux.

ob commented 5 years ago

@buchgr, do you mean with the flag or without? I am not enabling this flag and have seen the same behavior on macOS: Actions run locally when I know for a fact they are fully cached.

I opened https://github.com/bazelbuild/bazel/issues/7084 to help with debugging.

virusdave commented 5 years ago

What's the status of this?

MikhailTymchukFT commented 4 years ago

@ulfjack

Absent that, we'd actually check the ctime, which we already record in some cases (but maybe not all?).

Where do you store this info?

ulfjack commented 4 years ago

In FileContentsProxy via FileArtifactValue.

The check is already implemented, though. Just enable

--experimental_guard_against_concurrent_changes

cpsauer commented 4 years ago

Is there something blocking this from being enabled by default? Seems like a pretty clear, serious correctness issue, solved by @ulfjack's great work.

nouiz commented 4 years ago

I agree, the default should be safe. For people that can guarantee the source file doesn't change, giving them a speed up knob can be useful. But it should be safe by default.

philwo commented 3 years ago

@buchgr @ulfjack Does anyone remember why this flag isn't enabled by default? 😅 Is it because of the "There may be cases where the Linux kernel delays writing of files, which could cause false positives." part in the documentation of the flag?

uri-canva commented 3 years ago

According to the comments above the issue happens on macOS too. We disabled it regardless of that because the overhead is very high, but I assume since it's not enabled by default there haven't been many eyes on that anyway.

mmorearty commented 3 years ago

If this context helps, I can tell you that I don’t know how we would use Bazel if this flag wasn’t available. (We are a Mac shop.) Before this flag existed, corruption of our remote cache was somewhat rare, but when it did happen — perhaps once a month — we were pretty mystified and would have to spend a long time trying to figure out what happened and what to do about it.

bazelbuild / bazel