3rdparty / eventuals

Apache License 2.0
34 stars 7 forks source link

Use fast bazel settings (e.g. remote cache) whenever possible, and make it difficult to use non-fast settings #293

Open CodingCanuck opened 2 years ago

CodingCanuck commented 2 years ago

The current eventuals build instructions runs all builds locally. This is often slower (and ~never faster) than using the remote build cache that we've set up for this repo.

Users currently have to perform some undocumented steps in order to use the remote bazel cache for local builds. We should figure out how to give more users better build performance. Things to think about:

1) We probably can't make all build optimizations work for all users (e.g. we can't turn on world/anonymous write access to the GCP buckets which we use for bazel remote caches). 2) Builds should work without errors for users who aren't part of our organizations who check out our repositories 3) Frequent contributors to our codebases should not waste time on slow builds (meaning they should probably use build features like the bazel remote cache) 4) Users should have a hard time accidentally using non-optimal settings (e.g. if you can use the cache, you should use the cache)

It's unclear the best way to go about this. Maybe we could offer something like a one-time setup that users perform via something like a setup script to configure caches to be used (or not used), though we'd have to make sure that this setup happens before the first bazel build.

A big open question: how do other open source projects that use bazel handle this sort of bazel config, where some users use build features (e.g. bazel features like remote caching and remote execution) that are only available to an authorized subset of authenticated users?

CodingCanuck commented 2 years ago

See longer discussion on https://github.com/3rdparty/eventuals/pull/287

CodingCanuck commented 2 years ago

Some Googling leaves me empty handed: I can't find any examples of open source codebases that use bazel with a remote cache.

Some projects use it in what I think are CI environments (example)

One tiny project makes remote cache usage a --config (like --config=asan) and instructs users to manually add that to their personal .bazelrc file (example)

We talked about TensorFlow offline: I don't think they instruct external users on how to use a remote build cache. They do, however, instruct users to run a one-time ./configure step before building from source, which prompts users to create a locally appropriate build config before building.

This BazelCon video (which I found via a GitLab post) makes me think that another option might be a GCP bucket (or some other bazel cache implementation) that's world readable, but only writable from CI servers. That might be the best of all worlds: we're not exposing world-writable storage, but all users get fast read performance. The main downside might be the financial cost of whatever traffic is incurred from users performing reads on our cache: maybe that's not too concerning, especially if we have per-user quotas?

@rjhuijsman WDYT about this world-readable, write-from-CI-only approach?

rjhuijsman commented 2 years ago

@rjhuijsman WDYT about this world-readable, write-from-CI-only approach?

Seems like an excellent compromise, in that (1) it will likely take very little time, (2) it should be a huge speedup for everyone with no additional setup, and (3) bandwidth is pretty cheap compared to developer time.

From https://bazel.build/docs/remote-caching#read-write-remote-cache it does seem we need to explicitly specify build --remote_upload_local_results=false for those with read-only access. How would we do that?

CodingCanuck commented 2 years ago

@rjhuijsman WDYT about this world-readable, write-from-CI-only approach?

Seems like an excellent compromise, in that (1) it will likely take very little time, (2) it should be a huge speedup for everyone with no additional setup, and (3) bandwidth is pretty cheap compared to developer time.

From bazel.build/docs/remote-caching#read-write-remote-cache it does seem we need to explicitly specify build --remote_upload_local_results=false for those with read-only access. How would we do that?

The proposal I got from the BazelCon talk is that only the CI machine would upload results, so we'd just add --remote_upload_local_results=false to the .bazelrc and have the CI machines add --remote_upload_local_results=true.

CodingCanuck commented 2 years ago

An addendum: a world-readable remote build cache is likely only acceptable for our open-source repos, since a world-readable build cache for a closed-source repo could result in leaks.

rjhuijsman commented 2 years ago

@rjhuijsman WDYT about this world-readable, write-from-CI-only approach?

Seems like an excellent compromise, in that (1) it will likely take very little time, (2) it should be a huge speedup for everyone with no additional setup, and (3) bandwidth is pretty cheap compared to developer time. From bazel.build/docs/remote-caching#read-write-remote-cache it does seem we need to explicitly specify build --remote_upload_local_results=false for those with read-only access. How would we do that?

The proposal I got from the BazelCon talk is that only the CI machine would upload results, so we'd just add --remote_upload_local_results=false to the .bazelrc and have the CI machines add --remote_upload_local_results=true.

That makes total sense!

rjhuijsman commented 2 years ago

An addendum: a world-readable remote build cache is likely only acceptable for our open-source repos, since a world-readable build cache for a closed-source repo could result in leaks.

Agreed. We can use almost same approach for our closed-source repos, but give all of our codespaces an environment variable carrying read-only credentials for the build cache. That also hugely reduces the risk of that credential leaking.

CodingCanuck commented 2 years ago

An addendum: a world-readable remote build cache is likely only acceptable for our open-source repos, since a world-readable build cache for a closed-source repo could result in leaks.

Agreed. We can use almost same approach for our closed-source repos, but give all of our codespaces an environment variable carrying read-only credentials for the build cache. That also hugely reduces the risk of that credential leaking.

Yep. Though to emphasize, we'll want different buckets (1+ world readable, 1+ only readable by authn'd users) for different repos (open vs. closed source).

Credential leak: if CI machines are GitHub Action Runners, then users authorized to run CI builds (pre-merge checks) will still have access to those write-access secrets.

rjhuijsman commented 2 years ago

Credential leak: if CI machines are GitHub Action Runners, then users authorized to run CI builds (pre-merge checks) will still have access to those write-access secrets.

This makes me wonder whether at least our open source projects might eventually have two kinds of checks: pre-merge (uses read-only credentials) and post-merge (uses read-write credentials). But I can live with that being a TODO for the foreseeable future.

CodingCanuck commented 2 years ago

Credential leak: if CI machines are GitHub Action Runners, then users authorized to run CI builds (pre-merge checks) will still have access to those write-access secrets.

This makes me wonder whether at least our open source projects might eventually have two kinds of checks: pre-merge (uses read-only credentials) and post-merge (uses read-write credentials). But I can live with that being a TODO for the foreseeable future.

This makes sense to me!

CodingCanuck commented 2 years ago

From speaking to https://buildbuddy.io folks this week, they pointed out that Elastic's open source Kibana codebase is configured to use a bazel remote cache: https://github.com/elastic/kibana/blob/main/.bazelrc#L5-L11

The TL;DR is that they use a world-readable cache where local users have cache writes turned off (which is what we've been discussing).

CodingCanuck commented 2 years ago

Sadly, I saw a remote cache download timeout on a codespace today even after we've switched to a multi-region bucket:

WARNING: Remote Cache: 
com.google.devtools.build.lib.remote.http.DownloadTimeoutException: Download of '/reboot-dev-bazel-remote-cache-eventuals-us/ac/ba5d575626e12b8280d43576f37b7d8d2492a163eb012e84f93f47a9705f0eee' timed out. Received 0 bytes.
        at com.google.devtools.build.lib.remote.http.HttpDownloadHandler.exceptionCaught(HttpDownloadHandler.java:156)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
        at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireExceptionCaught(CombinedChannelDuplexHandler.java:424)
        at io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:92)
        at io.netty.channel.CombinedChannelDuplexHandler$1.fireExceptionCaught(CombinedChannelDuplexHandler.java:145)
        at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:143)
        at io.netty.channel.CombinedChannelDuplexHandler.exceptionCaught(CombinedChannelDuplexHandler.java:231)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
        at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
        at io.netty.handler.ssl.SslHandler.exceptionCaught(SslHandler.java:1104)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
        at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
        at com.google.devtools.build.lib.remote.http.IdleTimeoutHandler.channelIdle(IdleTimeoutHandler.java:43)
        at io.netty.handler.timeout.IdleStateHandler$AllIdleTimeoutTask.run(IdleStateHandler.java:579)
        at io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:478)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)

I'm not sure what ~SLA to expect here.