buchgr / bazel-remote

A remote cache for Bazel
https://bazel.build
Apache License 2.0
601 stars 156 forks source link

S3 Proxy with bucket expiration #395

Open djmarcin opened 3 years ago

djmarcin commented 3 years ago

We are using bazel-remote with a proxy to an s3 bucket that has a 30 day expiration. I believe we've started to encounter a situation where the CI machine's bazel-remote cache retains an artifact that is dropped from s3. Because the artifact exists in the bazel-remote cache, it is never checked whether it exists in the s3 bucket, therefore it is never refreshed after it expires from s3. This means that developer builds (which are read-only on the s3 bucket) cannot find the artifacts and the CI never refreshes them.

Is there a way to configure bazel-remote for this sort of expiration? If not, is there an API that could be used to purge the cache periodically, or some other way to ensure the presence of artifacts in the backing store is regularly queried.

mostynb commented 3 years ago

Hi, I don't think bazel-remote currently has a good solution for this. I suspect that most users of bazel-remote's s3 backend simply have deep pockets and never delete s3 objects, or maybe they simultaneously clear bazel-remote's disk cache- but these are just guesses.

There has been talk of possibly implementing a proxy-only mode (ie no local disk cache, proxy all requests to s3), but I have some performance concerns, and I don't have an AWS account to test with.

Another option would be to implement asynchronous uploads for every local disk cache hit (checking if they exist on s3 before uploading).

MaciejKucia commented 3 years ago

Would some kind of periodic bucket scan solve this?

mostynb commented 3 years ago

I don't think S3 keeps track of the last access time for a blob, so I'm not sure what we would scan for.

MaciejKucia commented 3 years ago

You can with cloudwatch but I mearly meant to scan for deleted objects. I want to expire S3 objects and re-populate if needed.

mostynb commented 3 years ago

Re-reading the original issue description, @djmarcin has CI talking to bazel-remote which proxies blobs to S3 (which expires blobs 30 days after they're created), and developers using direct read-only access to S3. And the problem is that blobs that stay in bazel-remote's disk cache are never repopulated in S3 after they're deleted there.

One potential solution (which https://github.com/znly/bazel-cache/ uses with GCS) is to make bazel-remote update the metadata for the S3 blob every time a blob is accessed from its disk cache, for S3 I guess this would mean copying the blob to itself, and then re-uploading it if it no longer exists. I think that copy can be done without uploading the data again.

Another, simpler solution might be to run a separate bazel-remote instance for developers which they have read-write access to, which has readonly access to the CI bazel-remote instance (using the http proxy backend). However this might also require modifications to bazel-remote if you're using compressed storage, which I haven't gotten around to yet.

carpenterjc commented 2 years ago

I have been considering this problem for S3 too, as we want a more sophisticated solution which allows us to keep build graphs referenced while using --remote_download_toplevel or --remote_download_minimal. The problem with those options is bazel will avoid downloading object files from 'cas', unless you need to relink that part of the build. If you have a cache expiration policy which deletes on last access time file irrespective of type, then in the rare case when you need to relink it will not be able to pull the required files from cas.

I wondered about adapting remote-cache such that when it validates the ac records, it will record all references to cas items in a linked redis instance. We can then write another simple container that queries all items in S3, and expires records which are no-longer referenced.

Our current solution parses http server access logs, which is error prone and slow due to the sheer quantity of data in the access logs.

If the data stored in redis is a simple hashmap of (cas/ac)/checksum date last accessed, then you don't need much information to exchange with redis as items only change at most once per day. S3 sweep could be as simple as query all items in S3, lookup last access date in redis, schedule for removal if not accessed.

It works with reproducible builds where many ac records may reference one cas record, because it only takes one ac access to keep its cas records alive.