Open aaronmondal opened 1 year ago
I thought about how to deal with kubernetes and I think the best way to handle this is to make a new implementation of the scheduler and have the json config file pick up the implementation and use it.
I don't remember all the details, but I do remember that a special scheduler (and possibly modified worker impl) is likely to give the best experience, since you'd be able to talk natively to the k8s, especially when it comes to auto-scaling.
After looking into this a bit more I think the configuration part should be fairly straightforward.
Actually good autoscaling is probably a bit harder, but seems doable as well.
I'd be interested to see what you are thinking for implementing it? I think the best way to proceed here is Io write a proof-of-concept that is throw away code then throw it off to reviewers for an overview of how it will work. This kind of task is going to be tricky to do in stages because there is a high chance of late stage design issues that are impossible to see early on.
So my recommendation is to make a really Jacky version that fully works, but not high quality code and does whatever it takes to make it work, then we can discuss how to best put it in production.
Im not sure I have a full view in my mind on how this would work with k8s,. For example, should the scheduler talk directly to the cluster controller or have an external scaling agent that monitors the scheduler for scale ups/down? It gets even more complicated if lamda support is added.
Fortunately @rsenghaas expressed interest to help me out and we'll go over the codebase over the weekend and brainstorm a bit :relaxed: Also cc @jaroeichler hit me up if you're interested :kissing:
@allada A few random thoughts we're considering:
cas
. The config files are fairly self-explanatory, but it might still be nice to bundle documentation in the CLI via clap
and support comman line configuration rather than only the json config file.cas
executable currently starts all services. Investigate whether it makes sense to split off these services into standalone executables.(*) I'm not sure autoscaling is actually the best way to go. At least naive autoscaling probably won't do. We've had bad experiences in the past with that with remote execution. RBE providers we tested seemed to optimize for ease of use and ease of scale, but not for raw performance.
Dynamically scaling worker nodes means that artifacts need to be allocated from cache nodes to those worker nodes. This can become a huge waste of power and resources if it's unclear how the underlying hardware actually looks like. E.g. splitting 2 threads from a 50 thread job out to a new machine could trigger gigantic data transfers. Anything that doesn't run a worker process on the same physical machine as the cache could cause significant latency increase. Sure, clouds have their own backbone networks and stuff, but that's still orders of magnitude slower than having the data on the same physical machine.
However, scaling nodes instead of pods could be a different story. But in any case, I'm fairly certain that we won't be able to abstract away the hardware in any way and it could be quite tricky to get this right. I love it :laughing:
Additionally, somewhat unrelated to turbo-cache itself, but relevant for @rsenghaas (and @jaroeichler ?) for our tests - We should add an S3-compatible storage backend to the ll up
cluster. For cloud tests we can use whatever S3-compatible storage backend cloud providers offer, but I'd be quite interested in testing a CubeFS deployment on bare metal k8s.
Where to start...
clap
probably works with serde
(but haven't looked). I would be surprised if enterprise deployments would use CLI, but I can see for experimenting and testing it could be useful.Dockerfile
in the deployment-examples
that is probably close to what you are thinking. Specializing one to k8s is not a bad idea at all and is likely the best way to develop k8s support.wasm
, but I would really like to hear an end-to-end use case and how wasm
would help.find_worker_for_action_mut()
and the storage of the available items on the scheduler. The big problem I was having with this idea was I could not think of a good efficient algorithm to use, but now after working on another project recently I think the best approach is one of the Approximate Nearest Neighbor
, since they are quite fast and some of them you can interrupt and still get a very good results (for example, we could look at the requests coming in to find a node for the job and while it's searching for the best node, if we find the queue is growing too fast, we can just send it off to whatever worker it is on when it was paused). I do have a bit of experience dealing with these scaling problems in enterprise environments, we found optimizing the CAS to xfer the data onto the node faster had the best impact (which is why I thought making the scheduler smarter might be the best general solution for people).redis
makes sense as a possible backend. I started down the path of making a redis store implementation, but abandoned it because in my past experience redis was much more expensive than just using s3
. But then again, our artifacts where likely 100x to 1000x larger than most use cases (each job used about 2-4k files @ ~8G-10G total, the top 5% of files rarely had a cache hit, resulting in ~5-8G needing downloaded per job). The only concern I have at this time on making the scheduler use redis
is the speed and quantity of locks needed. The thing I'd be super excited about in this idea is it would finally make the scheduler no longer the single-point-of-failure.node taints
and node affinities
in more detail?FYI, at one point I experimented with using an S3 filesystem, but realized to do this efficiently I'd need to write the fuse filesystem myself. The idea is that you could send fuse
a serialized version of the digest tree and when it is mounted, it would make the files visible in that specific layout in the mounted directory. The cool part is the fuse
driver would only download the parts of the files that are actually needed upon request. I found most jobs only use <10% of the actual files (even parts of the files) it says it needs. The last big advantage is you could easily gather stats on what parts of files are most frequently used and xfer those onto nodes at boot time (or as a shared persistent multi-mount EBS).
Hahaha this conversation is escalating quickly :rofl:
Where to start...
- For the command lines, I'm not against it, but I would like to make sure that anything that can be done in the CLI can also be done in the config file. I don't expect complicated mappings being needed to do this and
clap
probably works withserde
(but haven't looked). I would be surprised if enterprise deployments would use CLI, but I can see for experimenting and testing it could be useful.
Agreed. A CLI is only useful for experimenting and config files should be the "default" usage pattern.
- Eventually the CAS binary would likely need to be broken off into different binaries. But I have worked with projects in the past that separated things and it made experimenting and testing so complicated. What might work is to make a couple binaries, but one binary that contains all the services. This would possibly give a happy medium and would likely result in very little additional code.
This is also what I was thinking. If I'm not mistaken it should be possible to more-or-less leave the cas
target as it is and just add additional "finer grained" rust_binary
targets that use the service impls.
- If you are referring to publishing "official" container images, I don't think this project is quite ready for that. There already is a
Dockerfile
in thedeployment-examples
that is probably close to what you are thinking. Specializing one to k8s is not a bad idea at all and is likely the best way to develop k8s support.
Actually I don't care about final images, just about a CI-friendly way to build a k8s-friendly image. We usually build containers with nix directly from upstream sources and use incremental builds for the image layers similarly to how it works in Bazel. E.g. our remote execution image is built like that. No need to distribute a prebuilt image if you can fully reproduce it with the exact same final hash :smile: This might be a bit too incompatible with "most" setups though as it requires tooling to handle such image builds.
I do think k8s-specialized images in general seem like the lowest-hanging fruit at the moment though and I think it makes sense to focus on that first.
- I really like the idea of targeting
wasm
, but I would really like to hear an end-to-end use case and howwasm
would help.
First, there is the convenience that wasm basically obsoletes the entire container build pipeline. You can build the wasm module with a regular Rust toolchain and you're done and can immediately deploy it.
Second, since wasm modules start up orders of magnitude faster than a container: User makes request to server -> server starts "final mile" cache service wasm module -> wasm module serves the cache -> as soon as user closes the connection the wasm module is shut down
. Storage, including in-memory stores would persist across invocations and be handled "separately" from the scheduler. With that, one would only need to care about replicating storage (e.g. Redis/Dragonfly autoreplication) and could pretty much blindly deploy the actual scheduler to whatever edge network.
- Do you have any alternatives to auto-scaling? The design I was playing around with last year involved having the workers periodically publish information to the scheduler about what artifacts it had cached, then let the scheduler find the most optimal worker based on the cache. Obviously this would add a bit of complicity, but most of the difficult complexity stuff would be in an implementation of
find_worker_for_action_mut()
and the storage of the available items on the scheduler. The big problem I was having with this idea was I could not think of a good efficient algorithm to use, but now after working on another project recently I think the best approach is one of theApproximate Nearest Neighbor
, since they are quite fast and some of them you can interrupt and still get a very good results (for example, we could look at the requests coming in to find a node for the job and while it's searching for the best node, if we find the queue is growing too fast, we can just send it off to whatever worker it is on when it was paused). I do have a bit of experience dealing with these scaling problems in enterprise environments, we found optimizing the CAS to xfer the data onto the node faster had the best impact (which is why I thought making the scheduler smarter might be the best general solution for people).
Hmm that does sound reasonable. I don't have an issue with autoscaling per-se, I just feel like current implementations are too inefficient and don't take { hardware | latency | physical distance | usage patterns } into account well. This is a really hard thing to get right though. It might actually be an interesting research topic to use ML for something like that, and apparently a lot of research has already gone into optimizing ANN search like this. But this also seems like more of a "luxury" problem and first we need to get unoptimized distribution working at all :sweat_smile:
Just in case you didn't stumble upon it already: https://github.com/erikbern/ann-benchmarks :blush:
- I think using
redis
makes sense as a possible backend. I started down the path of making a redis store implementation, but abandoned it because in my past experience redis was much more expensive than just usings3
. But then again, our artifacts where likely 100x to 1000x larger than most use cases (each job used about 2-4k files @ ~8G-10G total, the top 5% of files rarely had a cache hit, resulting in ~5-8G needing downloaded per job). The only concern I have at this time on making the scheduler useredis
is the speed and quantity of locks needed. The thing I'd be super excited about in this idea is it would finally make the scheduler no longer the single-point-of-failure.
Fortunately we have a bunch of cloud credits that we need to burn within the next 9 months, so I think testing something like this should be fine for us :innocent: This also seems like something really fun to implement, so I'll play around with this :relaxed:
Our primary use case at the moment is building LLVM and I'd like to build it as frequently as possible. However, while LLVM gets a commit every ~10 min or so, and commits tend to cause mass-rebuilds, we'll just have very few build jobs for that running.
I believe one full build is ~10k actions, not sure how many files, but it's probably a lot smaller than your use case. If I remember correctly, inefficient cache/remoteexec usage brought us to ~200GB network transfer across ~50 jobs for from-scratch builds, but that shouldve been at most ~20 gigs and it shouldve happened on a single machine without any network transfer. Unfortunately the cloud config for those builds was outside of our control.
We've now pivoted to a different approach: The ll
toolchain can reuse remote artifacts during local execution. So our use-case is something like "1 CI runner on a 128 core machine builds 24/7 and everyone else downloads those baseline artifacts to their laptop and builds things locally". We need artifacts to be distributed as quickly as possible across the world, but for now we won't run exec jobs on different machines (and if we do we can manage those manually). Here is a visualization.
- Can you explain your thinking of
node taints
andnode affinities
in more detail?
Node affinity/label: Encourage deployments to go to the same node. For instance, enforce that a remote execution node always comes with a cache deployment that's potentially pre-populated before the remote execution service starts getting requests.
Node taint: Prevent a deployment to a tainted node, for instance to prevent a remote execution job from being run on an edge node. Or to prevent in-memory cache to be deployed to a "slow" storage node. Or to prevent S3 storage to be deployed to an in-memory cache node. Endless possibilities :rofl:
AFAIK affinities and labels can both be set and changed dynamically via e.g. Pulumi, Terraform or via operators. I'll have to look into whether this is actually useful though, and even if it is, it's probably a rather late-stage optimization.
FYI, at one point I experimented with using an S3 filesystem, but realized to do this efficiently I'd need to write the fuse filesystem myself. The idea is that you could send
fuse
a serialized version of the digest tree and when it is mounted, it would make the files visible in that specific layout in the mounted directory. The cool part is thefuse
driver would only download the parts of the files that are actually needed upon request. I found most jobs only use <10% of the actual files (even parts of the files) it says it needs. The last big advantage is you could easily gather stats on what parts of files are most frequently used and xfer those onto nodes at boot time (or as a shared persistent multi-mount EBS).
Sounds very interesting! Also sounds very custom tailored to the use case of remote caching which I'm not sure whether it's a good or bad thing. This sounds like it could improve performance quite significantly, but I wonder whether that's really worth the effort and complexity as compared to just "brute-force storing in memory". Just regarding monitoring, could this be a good use-case for an eBPF/XDP module? I've always wanted to play with that but never had a good usecase for it (until now?) :smile:
1 CI runner on a 128 core machine building 24/7
, this is very similar to the very first iteration of the design at a previous job. We had 2 user roles, "CI-main" and "end user". Anything published onto the main
branch had the CI flagged in a special way that it would publish the artifacts to the global ActionCache. All other CI jobs did not have remote-caching at all. End users would have a unique folder in a shared s3 bucket that they would publish their ActionCache to. All ContentAddressableStorage objects where stored together in a single s3 folder for all users and CI. So, to put it all together, when users invoked bazel, it would automatically startup a local proxy that bazel was connected too. This proxy would be configured to first attempt to download artifacts from the shared
s3 folder (the one CI-main
would publish too), then if it failed, try the directory tied to the user. This gave us high degree of confidence that end-users would not poison each other by accident.node properties
might solve the specific problem anyway.I wonder whether that's really worth the effort and complexity as compared to just "brute-force storing in memory"
, the real problem is that it takes a lot of time to download the artifacts to be able to remote execute. Just running a CAS node already can store stuff in memory/disk without any issues for faster servicing the request. But remote execution on the other hand requires (by spec) all files to be accessible at invocation time of the job. But many of these files are never actually read or when they do, only parts of them are read.To further #5, you may find some ideas for the rules_ll
project. As I'm sure you are aware bazel generates a .d
file first before executing the actual compile job. This is so it can then prune out all the unused files during execution. This dramatically improves cache hit. The real interesting thing is that what bazel could really benefit from is an AST tree of only reachable code and only compile that file. Parsing all the files and mapping out what symbols are actually needed would allow bazel to get a cache hit when a header file is changed but the .cc
file that imports that header does not use touch those touched lines.
The other issue I've noticed is the process that compilers use to generate debug symbols. Since bazel usually requires the entire file to be re-compiled if a variable name is changed or comment changed, but the stripped binary is actually the same. A much higher cache hit rate could be achieved by generating the debug symbols in a separate task similar to DWARF
then let the linker (which already takes forever) deal with combining everything.
Sorry for opening the discussion on this old(er) thread, but I think it would help a lot for adoption of TurboCache if a Helm chart would be provided. From the AWS example I can tell that the topology for a multi-node cluster is somewhat complex, so setting up the services, deployments, storage would be much simpler with a Helm chart. I understand it is not trivial to do.
I agree that a Helm chart is essentially a hard requirement for k8s users and would help adoption in that space a lot. It's very much planned to implement this. We're working towards this already, but it'll likely take some time until it's ready. Things that I'd consider "blockers" until we can implement a Helm chart:
This is only for the cache portion though. Note that I didn't include any form of user accounts etc, since I think it's better to figure this out once we have actual deployments running. We also probably want to split the current all-in-one binary into smaller binaries to reduce image size, but I wouldn't consider this a "hard blocker" (#394 works towards making this easier). We're working on support for OpenTelemetry, but I wouldn't consider it a blocker for an inital helm release (#387 works towards that as it lets us export metrics via an otel subscriber).
Release-wise, I think our informal "roadmap" until the release of a helm chart looks roughly like this:
@C0mbatwombat This is more of a sketch at the moment. If you see anything that strikes you as unreasonable or "could be improved" or "not as important", please let me know. Feedback from K8s users is very valuable to us :heart:
Makes sense, happy to see that you have a plan, thanks for all the hard work!
@blakehatch is this ticket closed by: https://github.com/TraceMachina/nativelink/issues/116
I think we could open other ticket for tracking GKE and/or Azure's Kubernetes. But this will be first.
Sorry for opening the discussion on this old(er) thread, but I think it would help a lot for adoption of NativeLink if a Helm chart would be provided. From the AWS example I can tell that the topology for a multi-node cluster is somewhat complex, so setting up the services, deployments, storage would be much simpler with a Helm chart. I understand it is not trivial to do.
@kubevalet (an alter GitHub ego) and @blakehatch are working on this one. Edited the name.
I'd like to run turbo-cache in a k8s cluster deployed with Pulumi so that we can automatically set it up for users as part of rules_ll. Simple yaml-manifests would be usable for users of raw k8s, Terraform and Pulumi.
I'd be willing to work on this :relaxed: