TraceMachina / nativelink

NativeLink is an open source high-performance build cache and remote execution server, compatible with Bazel, Buck2, Reclient, and other RBE-compatible build systems. It offers drastically faster builds, reduced test flakiness, and specialized hardware.
https://nativelink.com
Apache License 2.0
1.17k stars 109 forks source link

Add AWS-native k8s Deployment #116

Open aaronmondal opened 1 year ago

aaronmondal commented 1 year ago

I'd like to run turbo-cache in a k8s cluster deployed with Pulumi so that we can automatically set it up for users as part of rules_ll. Simple yaml-manifests would be usable for users of raw k8s, Terraform and Pulumi.

I'd be willing to work on this :relaxed:

allada commented 1 year ago

I thought about how to deal with kubernetes and I think the best way to handle this is to make a new implementation of the scheduler and have the json config file pick up the implementation and use it.

I don't remember all the details, but I do remember that a special scheduler (and possibly modified worker impl) is likely to give the best experience, since you'd be able to talk natively to the k8s, especially when it comes to auto-scaling.

aaronmondal commented 1 year ago

After looking into this a bit more I think the configuration part should be fairly straightforward.

Actually good autoscaling is probably a bit harder, but seems doable as well.

allada commented 1 year ago

I'd be interested to see what you are thinking for implementing it? I think the best way to proceed here is Io write a proof-of-concept that is throw away code then throw it off to reviewers for an overview of how it will work. This kind of task is going to be tricky to do in stages because there is a high chance of late stage design issues that are impossible to see early on.

So my recommendation is to make a really Jacky version that fully works, but not high quality code and does whatever it takes to make it work, then we can discuss how to best put it in production.

Im not sure I have a full view in my mind on how this would work with k8s,. For example, should the scheduler talk directly to the cluster controller or have an external scaling agent that monitors the scheduler for scale ups/down? It gets even more complicated if lamda support is added.

aaronmondal commented 1 year ago

Fortunately @rsenghaas expressed interest to help me out and we'll go over the codebase over the weekend and brainstorm a bit :relaxed: Also cc @jaroeichler hit me up if you're interested :kissing:

@allada A few random thoughts we're considering:


(*) I'm not sure autoscaling is actually the best way to go. At least naive autoscaling probably won't do. We've had bad experiences in the past with that with remote execution. RBE providers we tested seemed to optimize for ease of use and ease of scale, but not for raw performance.

Dynamically scaling worker nodes means that artifacts need to be allocated from cache nodes to those worker nodes. This can become a huge waste of power and resources if it's unclear how the underlying hardware actually looks like. E.g. splitting 2 threads from a 50 thread job out to a new machine could trigger gigantic data transfers. Anything that doesn't run a worker process on the same physical machine as the cache could cause significant latency increase. Sure, clouds have their own backbone networks and stuff, but that's still orders of magnitude slower than having the data on the same physical machine.

However, scaling nodes instead of pods could be a different story. But in any case, I'm fairly certain that we won't be able to abstract away the hardware in any way and it could be quite tricky to get this right. I love it :laughing:


Additionally, somewhat unrelated to turbo-cache itself, but relevant for @rsenghaas (and @jaroeichler ?) for our tests - We should add an S3-compatible storage backend to the ll up cluster. For cloud tests we can use whatever S3-compatible storage backend cloud providers offer, but I'd be quite interested in testing a CubeFS deployment on bare metal k8s.

allada commented 1 year ago

Where to start...

  1. For the command lines, I'm not against it, but I would like to make sure that anything that can be done in the CLI can also be done in the config file. I don't expect complicated mappings being needed to do this and clap probably works with serde (but haven't looked). I would be surprised if enterprise deployments would use CLI, but I can see for experimenting and testing it could be useful.
  2. Eventually the CAS binary would likely need to be broken off into different binaries. But I have worked with projects in the past that separated things and it made experimenting and testing so complicated. What might work is to make a couple binaries, but one binary that contains all the services. This would possibly give a happy medium and would likely result in very little additional code.
  3. If you are referring to publishing "official" container images, I don't think this project is quite ready for that. There already is a Dockerfile in the deployment-examples that is probably close to what you are thinking. Specializing one to k8s is not a bad idea at all and is likely the best way to develop k8s support.
  4. I really like the idea of targeting wasm, but I would really like to hear an end-to-end use case and how wasm would help.
  5. Do you have any alternatives to auto-scaling? The design I was playing around with last year involved having the workers periodically publish information to the scheduler about what artifacts it had cached, then let the scheduler find the most optimal worker based on the cache. Obviously this would add a bit of complicity, but most of the difficult complexity stuff would be in an implementation of find_worker_for_action_mut() and the storage of the available items on the scheduler. The big problem I was having with this idea was I could not think of a good efficient algorithm to use, but now after working on another project recently I think the best approach is one of the Approximate Nearest Neighbor, since they are quite fast and some of them you can interrupt and still get a very good results (for example, we could look at the requests coming in to find a node for the job and while it's searching for the best node, if we find the queue is growing too fast, we can just send it off to whatever worker it is on when it was paused). I do have a bit of experience dealing with these scaling problems in enterprise environments, we found optimizing the CAS to xfer the data onto the node faster had the best impact (which is why I thought making the scheduler smarter might be the best general solution for people).
  6. I think using redis makes sense as a possible backend. I started down the path of making a redis store implementation, but abandoned it because in my past experience redis was much more expensive than just using s3. But then again, our artifacts where likely 100x to 1000x larger than most use cases (each job used about 2-4k files @ ~8G-10G total, the top 5% of files rarely had a cache hit, resulting in ~5-8G needing downloaded per job). The only concern I have at this time on making the scheduler use redis is the speed and quantity of locks needed. The thing I'd be super excited about in this idea is it would finally make the scheduler no longer the single-point-of-failure.
  7. Can you explain your thinking of node taints and node affinities in more detail?

FYI, at one point I experimented with using an S3 filesystem, but realized to do this efficiently I'd need to write the fuse filesystem myself. The idea is that you could send fuse a serialized version of the digest tree and when it is mounted, it would make the files visible in that specific layout in the mounted directory. The cool part is the fuse driver would only download the parts of the files that are actually needed upon request. I found most jobs only use <10% of the actual files (even parts of the files) it says it needs. The last big advantage is you could easily gather stats on what parts of files are most frequently used and xfer those onto nodes at boot time (or as a shared persistent multi-mount EBS).

aaronmondal commented 1 year ago

Hahaha this conversation is escalating quickly :rofl:

Where to start...

  1. For the command lines, I'm not against it, but I would like to make sure that anything that can be done in the CLI can also be done in the config file. I don't expect complicated mappings being needed to do this and clap probably works with serde (but haven't looked). I would be surprised if enterprise deployments would use CLI, but I can see for experimenting and testing it could be useful.

Agreed. A CLI is only useful for experimenting and config files should be the "default" usage pattern.

  1. Eventually the CAS binary would likely need to be broken off into different binaries. But I have worked with projects in the past that separated things and it made experimenting and testing so complicated. What might work is to make a couple binaries, but one binary that contains all the services. This would possibly give a happy medium and would likely result in very little additional code.

This is also what I was thinking. If I'm not mistaken it should be possible to more-or-less leave the cas target as it is and just add additional "finer grained" rust_binary targets that use the service impls.

  1. If you are referring to publishing "official" container images, I don't think this project is quite ready for that. There already is a Dockerfile in the deployment-examples that is probably close to what you are thinking. Specializing one to k8s is not a bad idea at all and is likely the best way to develop k8s support.

Actually I don't care about final images, just about a CI-friendly way to build a k8s-friendly image. We usually build containers with nix directly from upstream sources and use incremental builds for the image layers similarly to how it works in Bazel. E.g. our remote execution image is built like that. No need to distribute a prebuilt image if you can fully reproduce it with the exact same final hash :smile: This might be a bit too incompatible with "most" setups though as it requires tooling to handle such image builds.

I do think k8s-specialized images in general seem like the lowest-hanging fruit at the moment though and I think it makes sense to focus on that first.

  1. I really like the idea of targeting wasm, but I would really like to hear an end-to-end use case and how wasm would help.

First, there is the convenience that wasm basically obsoletes the entire container build pipeline. You can build the wasm module with a regular Rust toolchain and you're done and can immediately deploy it.

Second, since wasm modules start up orders of magnitude faster than a container: User makes request to server -> server starts "final mile" cache service wasm module -> wasm module serves the cache -> as soon as user closes the connection the wasm module is shut down. Storage, including in-memory stores would persist across invocations and be handled "separately" from the scheduler. With that, one would only need to care about replicating storage (e.g. Redis/Dragonfly autoreplication) and could pretty much blindly deploy the actual scheduler to whatever edge network.

  1. Do you have any alternatives to auto-scaling? The design I was playing around with last year involved having the workers periodically publish information to the scheduler about what artifacts it had cached, then let the scheduler find the most optimal worker based on the cache. Obviously this would add a bit of complicity, but most of the difficult complexity stuff would be in an implementation of find_worker_for_action_mut() and the storage of the available items on the scheduler. The big problem I was having with this idea was I could not think of a good efficient algorithm to use, but now after working on another project recently I think the best approach is one of the Approximate Nearest Neighbor, since they are quite fast and some of them you can interrupt and still get a very good results (for example, we could look at the requests coming in to find a node for the job and while it's searching for the best node, if we find the queue is growing too fast, we can just send it off to whatever worker it is on when it was paused). I do have a bit of experience dealing with these scaling problems in enterprise environments, we found optimizing the CAS to xfer the data onto the node faster had the best impact (which is why I thought making the scheduler smarter might be the best general solution for people).

Hmm that does sound reasonable. I don't have an issue with autoscaling per-se, I just feel like current implementations are too inefficient and don't take { hardware | latency | physical distance | usage patterns } into account well. This is a really hard thing to get right though. It might actually be an interesting research topic to use ML for something like that, and apparently a lot of research has already gone into optimizing ANN search like this. But this also seems like more of a "luxury" problem and first we need to get unoptimized distribution working at all :sweat_smile:

Just in case you didn't stumble upon it already: https://github.com/erikbern/ann-benchmarks :blush:

  1. I think using redis makes sense as a possible backend. I started down the path of making a redis store implementation, but abandoned it because in my past experience redis was much more expensive than just using s3. But then again, our artifacts where likely 100x to 1000x larger than most use cases (each job used about 2-4k files @ ~8G-10G total, the top 5% of files rarely had a cache hit, resulting in ~5-8G needing downloaded per job). The only concern I have at this time on making the scheduler use redis is the speed and quantity of locks needed. The thing I'd be super excited about in this idea is it would finally make the scheduler no longer the single-point-of-failure.

Fortunately we have a bunch of cloud credits that we need to burn within the next 9 months, so I think testing something like this should be fine for us :innocent: This also seems like something really fun to implement, so I'll play around with this :relaxed:

Our primary use case at the moment is building LLVM and I'd like to build it as frequently as possible. However, while LLVM gets a commit every ~10 min or so, and commits tend to cause mass-rebuilds, we'll just have very few build jobs for that running.

I believe one full build is ~10k actions, not sure how many files, but it's probably a lot smaller than your use case. If I remember correctly, inefficient cache/remoteexec usage brought us to ~200GB network transfer across ~50 jobs for from-scratch builds, but that shouldve been at most ~20 gigs and it shouldve happened on a single machine without any network transfer. Unfortunately the cloud config for those builds was outside of our control.

We've now pivoted to a different approach: The ll toolchain can reuse remote artifacts during local execution. So our use-case is something like "1 CI runner on a 128 core machine builds 24/7 and everyone else downloads those baseline artifacts to their laptop and builds things locally". We need artifacts to be distributed as quickly as possible across the world, but for now we won't run exec jobs on different machines (and if we do we can manage those manually). Here is a visualization.

  1. Can you explain your thinking of node taints and node affinities in more detail?

Node affinity/label: Encourage deployments to go to the same node. For instance, enforce that a remote execution node always comes with a cache deployment that's potentially pre-populated before the remote execution service starts getting requests.

Node taint: Prevent a deployment to a tainted node, for instance to prevent a remote execution job from being run on an edge node. Or to prevent in-memory cache to be deployed to a "slow" storage node. Or to prevent S3 storage to be deployed to an in-memory cache node. Endless possibilities :rofl:

AFAIK affinities and labels can both be set and changed dynamically via e.g. Pulumi, Terraform or via operators. I'll have to look into whether this is actually useful though, and even if it is, it's probably a rather late-stage optimization.

FYI, at one point I experimented with using an S3 filesystem, but realized to do this efficiently I'd need to write the fuse filesystem myself. The idea is that you could send fuse a serialized version of the digest tree and when it is mounted, it would make the files visible in that specific layout in the mounted directory. The cool part is the fuse driver would only download the parts of the files that are actually needed upon request. I found most jobs only use <10% of the actual files (even parts of the files) it says it needs. The last big advantage is you could easily gather stats on what parts of files are most frequently used and xfer those onto nodes at boot time (or as a shared persistent multi-mount EBS).

Sounds very interesting! Also sounds very custom tailored to the use case of remote caching which I'm not sure whether it's a good or bad thing. This sounds like it could improve performance quite significantly, but I wonder whether that's really worth the effort and complexity as compared to just "brute-force storing in memory". Just regarding monitoring, could this be a good use-case for an eBPF/XDP module? I've always wanted to play with that but never had a good usecase for it (until now?) :smile:

allada commented 1 year ago
  1. I'm all for creating a k8s specialized image as you described :smile:
  2. Yeah, thanks I did see the ANN benchmark repo. We both agree this is a luxury and there are bigger fruits to pick right now.
  3. As for your 1 CI runner on a 128 core machine building 24/7, this is very similar to the very first iteration of the design at a previous job. We had 2 user roles, "CI-main" and "end user". Anything published onto the main branch had the CI flagged in a special way that it would publish the artifacts to the global ActionCache. All other CI jobs did not have remote-caching at all. End users would have a unique folder in a shared s3 bucket that they would publish their ActionCache to. All ContentAddressableStorage objects where stored together in a single s3 folder for all users and CI. So, to put it all together, when users invoked bazel, it would automatically startup a local proxy that bazel was connected too. This proxy would be configured to first attempt to download artifacts from the shared s3 folder (the one CI-main would publish too), then if it failed, try the directory tied to the user. This gave us high degree of confidence that end-users would not poison each other by accident.
  4. Yeah, I'd rather defer the topic of affinity and taints. I am pretty sure the already implemented node properties might solve the specific problem anyway.
  5. As for I wonder whether that's really worth the effort and complexity as compared to just "brute-force storing in memory", the real problem is that it takes a lot of time to download the artifacts to be able to remote execute. Just running a CAS node already can store stuff in memory/disk without any issues for faster servicing the request. But remote execution on the other hand requires (by spec) all files to be accessible at invocation time of the job. But many of these files are never actually read or when they do, only parts of them are read.

To further #5, you may find some ideas for the rules_ll project. As I'm sure you are aware bazel generates a .d file first before executing the actual compile job. This is so it can then prune out all the unused files during execution. This dramatically improves cache hit. The real interesting thing is that what bazel could really benefit from is an AST tree of only reachable code and only compile that file. Parsing all the files and mapping out what symbols are actually needed would allow bazel to get a cache hit when a header file is changed but the .cc file that imports that header does not use touch those touched lines.

The other issue I've noticed is the process that compilers use to generate debug symbols. Since bazel usually requires the entire file to be re-compiled if a variable name is changed or comment changed, but the stripped binary is actually the same. A much higher cache hit rate could be achieved by generating the debug symbols in a separate task similar to DWARF then let the linker (which already takes forever) deal with combining everything.

C0mbatwombat commented 11 months ago

Sorry for opening the discussion on this old(er) thread, but I think it would help a lot for adoption of TurboCache if a Helm chart would be provided. From the AWS example I can tell that the topology for a multi-node cluster is somewhat complex, so setting up the services, deployments, storage would be much simpler with a Helm chart. I understand it is not trivial to do.

aaronmondal commented 11 months ago

I agree that a Helm chart is essentially a hard requirement for k8s users and would help adoption in that space a lot. It's very much planned to implement this. We're working towards this already, but it'll likely take some time until it's ready. Things that I'd consider "blockers" until we can implement a Helm chart:

This is only for the cache portion though. Note that I didn't include any form of user accounts etc, since I think it's better to figure this out once we have actual deployments running. We also probably want to split the current all-in-one binary into smaller binaries to reduce image size, but I wouldn't consider this a "hard blocker" (#394 works towards making this easier). We're working on support for OpenTelemetry, but I wouldn't consider it a blocker for an inital helm release (#387 works towards that as it lets us export metrics via an otel subscriber).

Release-wise, I think our informal "roadmap" until the release of a helm chart looks roughly like this:

  1. Have an actual release (as in "a git tag") at all. This is currently the highest priority and will be available soon.
  2. Make the container setup reproducible and distribute "official" images.
  3. Create a Helm chart for cache-only deployments. There will probably be some unforseen issues that we need to figure out and it seems easier to do this if we ignore remote execution initially.
  4. Add remote exec capabilities to the chart.

@C0mbatwombat This is more of a sketch at the moment. If you see anything that strikes you as unreasonable or "could be improved" or "not as important", please let me know. Feedback from K8s users is very valuable to us :heart:

C0mbatwombat commented 11 months ago

Makes sense, happy to see that you have a plan, thanks for all the hard work!

MarcusSorealheis commented 7 months ago

@blakehatch is this ticket closed by: https://github.com/TraceMachina/nativelink/issues/116

I think we could open other ticket for tracking GKE and/or Azure's Kubernetes. But this will be first.

MarcusSorealheis commented 7 months ago

Sorry for opening the discussion on this old(er) thread, but I think it would help a lot for adoption of NativeLink if a Helm chart would be provided. From the AWS example I can tell that the topology for a multi-node cluster is somewhat complex, so setting up the services, deployments, storage would be much simpler with a Helm chart. I understand it is not trivial to do.

@kubevalet (an alter GitHub ego) and @blakehatch are working on this one. Edited the name.