Implement automatic garbage collection for the disk cache

buchgr commented 6 years ago

Break out from https://github.com/bazelbuild/bazel/issues/4870.

Bazel can use a local directory as a remote cache via the --disk_cache flag. We want it to also be able to automatically clean the cache after a size threshold has been reached. It probably makes sense to clean based on least recently used semantics.

@RNabel would you want to work on this?

@RNabel @davido

davido commented 6 years ago

I will look into implementing this, unless someone else is faster than me.

RNabel commented 6 years ago

I don't have time to work on this right now. @davido, if you don't get around to working on this in the next 2-3 weeks, I'm happy to pick this up.

daghub commented 6 years ago

Hi, I would also very much like to see this feature implemented! @davido , @RNabel did you get anywhere with your experiments?

RNabel commented 6 years ago

Not finished, but had an initial stab: https://github.com/RNabel/bazel/compare/baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size (this is mostly plumbing and figuring out where to put the logic it definitely doesn't work)

I figured the simplest solution is an LRU relying on the file system for access times and modification times. Unfortunately, access times are not available on windows through Bazel's file system abstraction. One alternative would be a simple database, but that feels like overkill here. @davido, what do you think is the best solution here? Also happy to write up a brief design doc for discussion.

buchgr commented 6 years ago

What do you guys think about just running a local proxy service that has this functionality already implemented? For exampe: https://github.com/Asana/bazels3cache or https://github.com/buchgr/bazel-remote? One could then point Bazel to it using --remote_http_cache=http://localhost:XXX. We could even think about Bazel automatically launching such a service if it is not running already.

ittaiz commented 6 years ago

I think @aehlig solved this problem for the repository cache. Maybe you can borrow his implementation here as well. @buchgr, I feel this is core Bazel functionality and in my humble opinion outsourcing it isn’t the right direction. People at my company are often amazed Bazel doesn’t have this fully supported out of the box. On Tue, 11 Sep 2018 at 13:14 Robin Nabel notifications@github.com wrote:

Not finished, but had an initial stab: RNabel/bazel@ baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size https://github.com/RNabel/bazel/compare/baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size (this is mostly plumbing and figuring out where to put the logic it definitely doesn't work)

I figured the simplest solution is an LRU relying on the file system for access times and modification times. Unfortunately, access times are not available on windows through Bazel's file system abstraction. One alternative would be a simple database, but that feels like overkill here. @davido https://github.com/davido, what do you think is the best solution here? Also happy to write up a brief design doc for discussion.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/5139#issuecomment-420221831, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUIF_yJPnfWAoPzJufI6WwjckenYmNUks5uZ4zygaJpZM4TvSgK .

aehlig commented 6 years ago

I think @aehlig solved this problem for the repository cache. Maybe you can borrow his implementation here as well.

@ittaiz, what solution are you talking about? What we have so far for the repository cache is that the file gets touched on every cache hit (see e0d80356eed), so that deleting the oldest files would be a cleanup; the latter, however, is not yet implemented, for lack of time.

For the repository cache, it is also a slightly different story, as clean up should always be manual; upstream might have disappeared, to the cache might be last copy of the archive available to the user—and we don't want to remove that on the fly.

buchgr commented 6 years ago

outsourcing it isn’t the right direction

I would be interested to learn more about why you think so.

ittaiz commented 6 years ago

@aehlig sorry, my bad. You are indeed correct. @buchgr, I think so because I think a disk cache is a really basic feature of Bazel and the fact that it doesn’t work like this by default is IMHO a leaky abstraction (of how exactly the cached work) and influenced greatly by the fact that googlers work mainly (almost exclusively?) with remote execution. I’ve explained bazel to tens maybe hundreds of people. All of them were surprised disk cache isn’t out of the box (eviction wise and also plans wise like we discussed). On Tue, 11 Sep 2018 at 16:24 Jakob Buchgraber notifications@github.com wrote:

outsourcing it isn’t the right direction

I would be interested to learn more about why you think so.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/5139#issuecomment-420273144, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUIF-CT0FTFJOrIqJUvj5rmeKlfT502ks5uZ7mKgaJpZM4TvSgK .

buchgr commented 6 years ago

@ittaiz the disk cache is indeed a leaky abstraction that was mainly added because it was easy to do so. I agree that if Bazel should have a disk cache in the long term, then it should also support read/write through to a remote cache and garbage collection.

However, I am not convinced that Bazel should have a disk cache built in but instead this functionality could also be handled by another program running locally. So I am trying to better understand why this should be part of Bazel. Please note that there are no immediate plans to remove it and we will not do so without a design doc of an alternative. I am mainly interested in kicking off a discussion.

ittaiz commented 6 years ago

Thanks for the clarification and I appreciate the discussion. I think that users don’t want to operate many different tools and servers locally. They want a build tool that works. The main disadvantage I see is that it sounds like you’re offering a cleaner design at the user’s expense. On Thu, 13 Sep 2018 at 22:56 Jakob Buchgraber notifications@github.com wrote:

@ittaiz https://github.com/ittaiz the disk cache is indeed a leaky abstraction that was mainly added because it was easy to do so. I agree that if Bazel should have a disk cache in the long term, then it should also support read/write through to a remote cache and garbage collection.

However, I am not convinced that Bazel should have a disk cache built in but instead this functionality could also be handled by another program running locally. So I am trying to better understand why this should be part of Bazel. Please note that there are no immediate plans to remove it and we will not do so without a design doc of an alternative. I am mainly interested in kicking off a discussion.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/5139#issuecomment-421132801, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUIF8ewS8x09uklzku9r6-aS6zUeLqYks5uarh4gaJpZM4TvSgK .

buchgr commented 6 years ago

I think that users don’t want to operate many different tools and servers locally.

I partly agree. I'd argue in many companies that would change as you would typically have an IT department configuring workstations and laptops.

The main disadvantage I see is that it sounds like you’re offering a cleaner design at the user’s expense.

I think that also depends. I'd say that if one only wants to use the local disk cache then I agree that providing two flags is as friction less as it gets. However, I think it's possible that most disk cache users will also want to do remote caching/execution and that for them this might not be noteworthy additional work.

So I think there are two possible future scenarios for the disk cache:

Add garbage collection to the disk cache and be done with it.
Add garbage collection, remote read fallback, remote write and async remote writes.

I think 1) makes sense if we think that the disk cache will be a standalone feature that a lot of people will find useful on its own and if so I think its worth the effort to implement this in Bazel. For 2) I am not so sure as I can see several challenges that might be better solved in a separate process:

Async remote writes are the idea that Bazel writes blobs to the disk cache and then asynchronously (to the build) writes them to the remote cache thereby removing the upload time from the build's critical path. This is difficult to implement in Bazel, partly because there are no guarantees about the lifetime of the server process and partly because of lots of edge cases.
We might want to move authentication for remote caching/execution out of Bazel in the long term. We currently support Google Cloud authentication, we are about to add AWS and if we are successful I think its likely that we will need to add many more in the future and these authentication SDKs are quite large and increase the binary size. So we might end up with a separate proxy process anyway.
It's unconventional and potentially insecure that one has to pass authentication flags and secrets to Bazel itself. It seems to me that a separate process running as a different user that hides the authentication secrets from the rest of the system using OS security mechanisms is a better idea.
Once we implement a virtual remote filesystem (planned for Q4) in Bazel and then Bazel does not need to download cached artifacts anymore then the combination of a local disk cache and remote cache might become less attractive because downloads should no longer be a bottleneck (if it works out as expected).

So I think it might make sense for us to think about having a standard local caching proxy that's a separate process and that can be operated independently and/or that Bazel can launch automatically for improved usability might be an idea worth thinking about.

bayareabear commented 5 years ago

Is there any plan to roll out the "virtual remote filesystem" soon? I am interested to learn more about it and can help if needed. We are hitting network speed bottleneck.

buchgr commented 5 years ago

yep, please follow https://github.com/bazelbuild/bazel/issues/6862

thekyz commented 4 years ago

any plan of implementing the max size feature or a garbage collector for the local cache?

brentleyjones commented 4 years ago

This is a much needed feature in order to use Remote Builds without the Bytes, since naively cleaning up the disk cache results in build failures.

nkoroste commented 4 years ago

Any updates on this?

wesleyw72 commented 4 years ago

+1 We would like to be able to set the max size for the cache. Currently we rely on users doing this manually. We could add a script to do this but it feels like it would be a good feature for Bazel to have.

mr-salty commented 4 years ago

+1 on this, I had to write a script to keep my local disk from filling up. (by doing this I also discovered that something creates non-writable directories in .cache/bazel which seems bad in general)

tkbrex commented 4 years ago

+1 on this feature require. I need it so I can run it inside a docker container.

mihaigalos commented 4 years ago

+1.

TamaMcGlinn commented 3 years ago

Some of you have mentioned that you have implemented your own workarounds, it would be great to post them in this thread. Because mine is just terrible; when my OS complains that it has 0 bytes left, I delete ~/.cache/bazel and the next build will be very slow.

nouiz commented 3 years ago

On linux, I was using the find command to delete the oldest files. I use something like: find /PATH_TO_DIRECTORY -type f -mtime +60 -delete

The +60 mean to delete files not changed in the last 60 days. So depending how quickly it fill, adjust this value.

Take care with that command. It is dangerous! It can easily delete foo much files too.

GMNGeoffrey commented 3 years ago

I had a workaround similar to @nouiz but on a crontab

@daily find /usr/local/google/home/gcmn/.cache/bazel* -mtime +12 -type f -delete

but it ended up causing really hard to debug issues (see https://github.com/bazelbuild/bazel/issues/12630).

Note that there are two Bazel caches here.

The one stored in ~/.cache/bazel by default is not the disk cache referenced in this bug. It's the output directory for builds (see https://docs.bazel.build/versions/master/output_directories.html). This will contain:

install/, unpacked data files from the Bazel binary (0.5G on my machine)
cache/, which is not mentioned anywhere in the documentation and I have no idea what it does but based on the file paths like cache/repos/v1/content_addressable/sha256/ is I guess some kind of content addressable indexing of repos :shrug: (300M on my machine)
One directory per workspace root from which you've invoked Bazel (as in the path to the workspace on your machine). Each of these is an "output base" On my machine these are typically a 0.5-1.5G each, but it's going to depend on what you're building.

Probably you don't want to delete the first two directories (well, as I said, I have no idea what the second one is for, but best not to touch it). They don't seem to grow in size over time either. Based on my experience in https://github.com/bazelbuild/bazel/issues/12630, the cache entries for the individual workspace roots are not at the file-level, however. That is, you can't just delete a single file in this "cache" and expect a correct build. They're at some directory level that is more granular than the whole directory, but I'm honestly not sure how much more. To make things more interesting, the timestamps of files are stubbed out in some places, so mtime is going to behave poorly on them. So I think the thing to do here is to look at the mtime of $OUTPUT_BASE/lock. This should contain the last time this entire directory was actually used and would help you clean up old directories. I'm pretty sure you could delete things in a more granular fashion, but it would require more investigation to see how to do so safely. Like some of these are fetches of entire external repositories that Bazel will refetch if they're not present (but will get very upset if only part of them is present).

Now moving to the Bazel disk cache, which is actually what's referenced in this bug. You determine the location of this directory based on --disk_cache. Personally, I set build --disk_cache=~/.cache/bazel-disk-cache in ~/.bazelrc so it always goes there. I think my aforementioned cron was behaving fine with this cache, for which individual files are entire cache entries (at least I didn't notice anything like the other issue). For now, I've disabled my cron and will reinvestigate it the next time Bazel brings my machine to a screeching halt by using all my disk space.

The general theme here is that Bazel has caches, but they're missing a pretty key feature of caches: eviction. Without them, users are left implementing weird and hacky workarounds. I wish someone from the Bazel team could at least endorse some workaround (like a safe script to run on a cron).

mr-salty commented 3 years ago

~/.cache/bazel was definitely growing (apparently without bound?) for me.

I changed jobs so I'm not using bazel anymore, but this is the script I was running from cron: https://gist.github.com/mr-salty/a66119941e797d9eb49b15ea211ea968

It's mostly just find but takes case of some subtle issues. I never did track down how I ended up with non-writable directories in my cache, maybe something involving docker... so most people may not need that. Feel free to use it as needed, but bazel should really take care of this itself.

TamaMcGlinn commented 3 years ago

@nouiz thank you for the valiant effort; but at least on my machine, I can see from this command that some if not all my bazel files were made on january 1st 1970.

find ~/.cache/bazel/ -type f -mtime +12000 | xargs ls -la

If anyone wants to use find mtime to get old files, run something like that first to check that you are not just going to delete everything.

nouiz commented 3 years ago

Thanks for the warning. I do not recall having such problem with the dates of times. It is good to know!

nkoroste commented 3 years ago

I think you want to use access time -atime instead of modified time -mtime, there could be older files that don't change frequently but are still used during each build.

nouiz commented 3 years ago

Good point. But atime doesn't always works. On NFS server, most of the time the update of atime is disabled. I do not recall all the detail of why I used mtime. I found one that works well enough for me and used it ;)

TF frequently force the rebuild of the everything. So for me it wouldn't have helped.

GMNGeoffrey commented 3 years ago

at least on my machine, I can see from this command that some if not all my bazel files were made on january 1st 1970.

Yeah this is what I mentioned above

the timestamps of files are stubbed out in some places

apparently they do this to avoid build dependence on the timestamp of the file (and therefore loss in hermeticness/caching)

chandlerc commented 3 years ago

Pulling back to the --disk_cache aspect itself...

I find the boring use of a disk cache without any remote caching pretty awesome. It is essentially ccache but substantially better and more powerful. I would really love to see some way to bound the size / GC the old entries.

I'll note that GitHub's new, fancy action system has a built-in and very nice way to persist caches from run-to-run for things like CI. Using this in conjunction with the disk cache of Bazel results in a very clean way to have very fast CI builds with very minimal risk of corruption by keeping the cached state extremely small and focused. However, it needs some way to GC things.

Currently, for CI I use a terrible hack of manually setting all the atimes back by several years for the entire disk cache before running the build, and then deleting any parts of the disk cache whose atime isn't updated during the build. This has the rough effect of working even with Linux-style relatime. It of course won't work on file systems without atime or places like Windows I suspect. Something from Bazel itself would be fantastic.

j3parker commented 3 years ago

I'll note that GitHub's new, fancy action system has a built-in and very nice way to persist caches from run-to-run for things like CI. Using this in conjunction with the disk cache of Bazel results in a very clean way to have very fast CI builds with very minimal risk of corruption by keeping the cached state extremely small and focused. However, it needs some way to GC things.

💯

That cache is specifically 5GB per repo:

A repository can have up to 5GB of caches. Once the 5GB limit is reached, older caches will be evicted based on when the cache was last accessed. Caches that are not accessed within the last week will also be evicted.

nickbreen commented 3 years ago

A reasonably straight-forward command (based on this SO question):

# find files; sort by last accessed time [%A@]; accumulate file size in 512B blocks [%b]; print path [%p] when capacity exceeded; delete
find $BAZEL_DISK_CACHE -type f -printf '%A@ %b %p\0' |
    sort --numeric-sort --reverse --zero-terminated |
    awk --assign RS='\0' --assign ORS='\0' --assign CAPACITY=$((1 * 1024 ** 3 / 512)) '{du += $2}; du > CAPACITY { print $3 }' |
    xargs -r0 rm

nkoroste commented 3 years ago

A reasonably straight-forward command (based on this SO question):

# find files; sort by last accessed time [%A@]; accumulate file size in 512B blocks [%b]; print path [%p] when capacity exceeded; delete
find $BAZEL_DISK_CACHE -type f -printf '%A@ %b %p\0' |
    sort --numeric-sort --reverse --zero-terminated |
    awk --assign RS='\0' --assign ORS='\0' --assign CAPACITY=$((1 * 1024 ** 3 / 512)) '{du += $2}; du > CAPACITY { print $3 }' |
    xargs -r0 rm

is this better than some of the solutions already mentioned above? for example:

find "$CACHE_DIR" -type f -atime +$DAYS_OF_CACHE_TO_KEEP -delete >/dev/null 2>/dev/null

(can be replaced with mtime depending on the use case)

That being said, the main issue with any of these solutions is that depending on the cache size it can take a very long time to delete it. Which means you have to background this task which in turn introduces more complexity because now you want to block future Bazel calls while deleting is still in progress.

GMNGeoffrey commented 3 years ago

is this better than some of the solutions already mentioned above?

It allows you to specify a max size instead of a max age, which is pretty nice. Definitely users doing their own cleanup with random scripts is not ideal in general

nkoroste commented 3 years ago

Ah right, that is handy. p.s. if anyone looking for the fastest possible way to delete large amount of files I found that rsync or perl is the fastest. More info here https://unix.stackexchange.com/questions/37329/efficiently-delete-large-directory-containing-thousands-of-files

jscheid-ventana commented 2 years ago

I ran into issues with partial cache content deletions, as we have potentially continuous use of the cache with no downtime. We now just rename the cache directory and then delete all of it. This is unfortunate.

See issues like https://github.com/bazelbuild/bazel/issues/8508#issuecomment-511664292 and https://github.com/bazelbuild/bazel/issues/8250

limdor commented 1 year ago

I came up to this issue when trying to find if there is some bazel clean parameter in order to clean the disk cache. I find great the proposal of this ticket, but woudn't make sense to implement it on bazel clean? At least additionally, it does not have to be instead. The reason for that is that you could have more control and what and when to delete. You could decide for example that every night you do a bazel clean of the disk cache and afterwards execute a build. I could see a command similar to this very handy: bazel clean --disk_cache_last_access_older_than=<N days> With an example bazel clean --disk_cache_last_access_older_than=30 would delete the artifacts of the cache that have not been accessed in the last 30 days.

If you find this proposal interesting just let me know if we should continue the discussion here or just create another ticket to not mix topics. I find it relevant to post the initial thoughts here because I think it is closely related.

chandlerc commented 1 year ago

What do you guys think about just running a local proxy service that has this functionality already implemented? For exampe: Asana/bazels3cache or buchgr/bazel-remote? One could then point Bazel to it using --remote_http_cache=http://localhost:XXX. We could even think about Bazel automatically launching such a service if it is not running already.

I think this is a fine implementation strategy, but it should really be an implementation detail IMO -- something Bazel does transparently behind the scenes.

I think the real goal for user experience should be a configured max size for the disk cache, just like ccache and other tools provide that users set and "forget".

brentleyjones commented 1 year ago

Seems that with https://github.com/bazelbuild/bazel/commit/97f64817472737960841c255baf00bc18df7c6e6 implemented, this can be done pretty easily by Bazel now?

Ryang20718 commented 1 year ago

Even with a remote cache, if someone is running bazel locally and running bazel test would continue to generate runfiles right?

These runfiles would then accumulate and eventually bazel would halt? If anyone has a workaround besides pruning based on mtime and using a remote cache, would greatly appreciate the suggestion!

snakethatlovesstaticlibs commented 1 year ago

+1 to this issue as well, I'm trying to POC out bazel as a new build system on my team, and setting up a cache server for a prototype seems overkill

coeuvre commented 1 year ago

@tjgq and I had several brainstorm sessions about garbage collection for disk cache in the past days and we now have a solid design for that! We don't have a timeline to work on the implementation yet, but Q3 looks like a reasonable slot.

coeuvre commented 1 year ago

There were two important questions for our design of the garbage collector:

How to decided which blob in the disk cache should be removed?
How/when to run the garbage collection for a shared disk cache?

The answer to the first question is similar to the workaround you have posted here: use mtime. However, since we now have the lease service which will extend the leases of blobs required by Bazel during the invocation, we can let disk cache update the mtime of referenced blobs during lease extension. This allows us to do the garbage collection based on the real access pattern. And combining with --experimental_remote_cache_ttl, we won't delete blobs that are still needed by Bazel. On that other hand, even if the blobs were accidentally deleted causing remote cache eviction error, lease server can help Bazel recover from it.

We don't want the garbage collection have impact on the build performance, nor users should be aware of it. So it happens in the background between invocations. The garbage collection should be interruptible and resumable, so that upon a new invocation, we can cancel it immediately without blocking the invocation and continue it afterwards. Since the CAS is sharded by the first byte of the blobs' digest, we define the unit of work for the garbage collection as shard, i.e. we can resume the garbage collection on the shard basis.

We don't want to run garbage collection after every invocation. We store the global state (e.g. when did last garbage collection finish) in the disk cache and schedule the garbage collection based on that (e.g. once per day).

We want to continue supporting the use case that multiple Bazel instances share the same disk cache. To prevent from having concurrent garbage collections, we use a lockfile created inside the disk cache by open(O_CREAT | O_EXCL) (and equivalent one on Windows. However, it might not work on some NFSs. We do have a plan B for that but we don't want to make it complicate now. Consider using a real remote cache server if you want to share cache across machines). The mtime of the lockfile is updated continuously during the garbage collection to indicate it is still in progress in case of Bazel server crashed (or Ctrl-Ced).

cc @tjgq in case I missed something.

smolkaj commented 1 year ago

It's really awesome to see progress on this issue :)

We don't want the garbage collection have impact on the build performance, nor users should be aware of it. So it happens in the background between invocations.

Would that work in the context of continuous integration / Github Actions? In that context, there isn't really any or much time "between invocations": a VM comes up, downloads the cache, builds a bunch of stuff, and uploads the updated cache. Example workflow.

This seems to be a common use case cited in this thread, e.g. see @chandlerc's commented I linked.

EDIT: Perhaps there would still be enough time for GC for that use case, or perhaps there could be a flag to explicitly request GC.

coeuvre commented 1 year ago

EDIT: Perhaps there would still be enough time for GC for that use case, or perhaps there could be a flag to explicitly request GC.

I guess a flag that forces a GC run after invocation could work for this case.

meisterT commented 1 year ago

Alternatively, we could provide a simple script that does disk cache GC following the same algorithm (based on mtime) outside of the Bazel process. Then your CI run could call that before or after the build.

limdor commented 1 year ago

A parameter of the bazel clean would already be helpful for a lot of users. A lot of people cares about disk space only when is full but with the current clean you can not clean only what is not being used.

meisterT commented 1 year ago

@limdor we also discussed this but don't really like it because we do not want to teach users to run bazel clean - this is something they should not need to do.

limdor commented 1 year ago

@limdor we also discussed this but don't really like it because we do not want to teach users to run bazel clean - this is something they should not need to do.

That is a very good point. I understand and I agree, until now this is the only reason why I have to run bazel clean

bazelbuild / bazel

Implement automatic garbage collection for the disk cache #5139