ccache / ccache

ccache – a fast compiler cache
https://ccache.dev
Other
2.33k stars 497 forks source link

S3 storage backend #1201

Open m-ildefons opened 2 years ago

m-ildefons commented 2 years ago

Dear Mr. Rosdahl,

a few weeks ago, I implemented a secondary storage backend for S3 storage. The code is currently in proof-of-concept status and lives here. The implementation is based on the aws-cpp-sdk at the moment and has been tested manually against a local S3 endpoint. I would very much like to commit this feature upstream and appreciate any feedback.

I can think of several use cases, where this feature would come in handy. For example, when using Github hosted runners to build large code bases, the S3 backend would provide an economic path to keep the cached data longer than the built-in storage with Github would allow. This is useful when builds don't happen sporadically as the built-in storage of Github has rather short retention policies. According to my napkin math, it's also more economic than using the existing HTTP or Redis backends, because S3 comes much cheaper than an EC2 instance or a managed Redis instance with the same amount of storage - at least within AWS.

Best regards.

afbjorklund commented 2 years ago

There were some similar discussions regarding Azure Blob storage and azure-sdk-for-cpp here:

Something that would be nice-to-have would be a proxy, that could talk to the cloud storage backends.

This exists for the Redis backend today, but would be nice also for HTTP - and maybe even for File... ?


This proxy sets up the connection to the backend, handles the TLS overhead and the authentication etc.

Then the local communication can use some efficient unix socket, and just worry about get/put (remove).

It could also use some "plugin" system (or different servers?), to handle the bloat of these cloud SDK libraries.

Currently ccache doesn't do any SSL, because of this overhead (both in startup time, and in code size) :

While using SSL might be mandatory, in these environments.

afbjorklund commented 2 years ago

According to my napkin math, it's also more economic than using the existing HTTP or Redis backends,

A performance vs pricing comparison would also be great to have, between the different AWS alternatives.

Compare https://github.com/ccache/ccache/issues/1152#issuecomment-1240300915

Like a blog post or somesuch ?

afbjorklund commented 2 years ago

You should also compare with s3-fuse (and FileStorage)

m-ildefons commented 2 years ago

Hey Andreas,

thanks for the input. I've seen the Azure blob storage discussion but afaik Azure blob storage uses a different protocol, which is why I created a separate discussion for S3. Also that discussion somewhat changed course away from adding support natively to creating a config using blobfuse. I'm no fan of layering a file abstraction in between, because I think that it's not going to help performance and the additional complexity would probably drive users away from actually putting it into production. Keep in mind that this should ideally be useful in situations like CI pipelines, where there are only limited possibilities for running additional daemons, mounting filesystem etc. And not to forget, the FileStorage backend is basically just implementing a local object store on top of a file system, exposing a simple get/put/delete API internally to Ccache. Cutting that out of the equation, when there is already object storage would be favorable IMO.

The two advantages I see for the proxy solution/s3-fuse solution is a) it might allow re-use of a single TLS connection, eliminating the overhead of the hand-shake and b) it would separate the code bases such that Ccache itself doesn't have to suck in so much protocol specific code. I'd quite like to see a performance comparison.

In this implementation, the TLS is handled by the aws-cpp-sdk, but if needed the part of the S3 protocol we need is simple enough to implement it in Ccache directly, but that would require adding TLS support there. The overhead of the TLS handshake is unfortunate then, but often pales in comparison to the compilation time some sources require. And this is especially true inside limited CI environments like Github actions. So even if it's not as fast as reusing a single connection, there are gains to be made. I'll admit that pulling in large parts of the aws-cpp-sdk of which we only need a very small subset isn't really nice.

The proxy daemon with Unix pipe to Ccache solution is something I haven't thought about yet. It sounds like it would provide the best performance at the cost of having to configure an additional daemon. Can you provide me with a link to that Redis proxy? I failed to find it myself.

I might do a blog post about this feature once there is a clear path to up streaming it, the pricing argument would definitively be discussed in there in more detail. Keep in mind, that so far I've just done some napkin math for this.

One thing to keep in mind too is that S3 doesn't necessarily mean AWS. There are multiple solutions1 3 for providing self-hosted S3 endpoints and these don't necessarily need to use TLS. Using such an S3 endpoint as a secondary cache can be very advantageous when you have the endpoint there already. And on a private network, TLS may not be a requirement.

afbjorklund commented 2 years ago

It would be "nice" with a native solution, just saying that there are alternatives... (for completeness, you could also "just use a disk" and mount it with NFS etc* ?)

* as in EFS: https://docs.aws.amazon.com/efs/

And the solution could handle both, with the daemon being a performance add-on That it how it worked with the proxies for memcached and redis, they are optional.

Original one for memccache was couchbase "moxi"

Can you provide me with a link to that Redis proxy? I failed to find it myself.

https://github.com/twitter/twemproxy (nutcracker)

https://github.com/ccache/ccache/wiki/Redis-storage

afbjorklund commented 2 years ago

I'll admit that pulling in large parts of the aws-cpp-sdk of which we only need a very small subset isn't really nice.

To be honest I don't really know what these SDKs are providing in addition to the existing HTTP/HTTPS storage ?

m-ildefons commented 2 years ago

Thanks for pulling that link up. I can only speak for the aws-sdk, as I haven't used the azure one. It provides a convenient way to access the various HTTP endpoints with the right headers set and the right request types and TLS handled and everything from C++. To do that they provide lots of classes modelling the data returned by various API calls and lots of utility functions and classes to create the objects that you send to the API. So naturally there's a lot more in there than just the GET/PUT/DELETE part that we need, e.g. lots of classes for handling ACLs and Lifecycles and Quotas etc.

In any case, I just stumbled accross: https://github.com/mozilla/sccache which has storage backends both for azure and S3

afbjorklund commented 2 years ago

I think it will need a plugin system, before it can depend on external libraries like SSL or SDK (without being turned OFF)

jrosdahl commented 2 years ago

Hi @m-ildefons, thanks the feature request. It would certainly be good to support S3 storage.

I am unfortunately not interested in adding an S3 storage backend to the current set of backends if the backend depends on AWS-SDK. I have now written some background and thoughts about it here: #1214

m-ildefons commented 2 years ago

Hi @jrosdahl , thanks for getting back to me. The proposal for the long-lived backend service sounds very reasonable to me. I've also taken a small look at https://github.com/mozilla/sccache implementation and it seems they are doing exactly that - except they use a TCP socket instead. Off the top of my head, there are several things we should think of when implementing that backend daemon: 1) We should avoid situations where user/group permissions can cause trouble, e.g. in directories that have shared access by multiple users. 2) Parallel compilations would result in multiple ccache processes accessing the same backend process at the same time, this should not cause contention. 3) We might want to consider shared memory instead of sockets for performance reasons. Although that requires a carefully designed API for the backend daemon.