Open SuperSandro2000 opened 1 month ago
The latest progress I know of is my and Valentin's notes. @edef has been the primary person spending time on the data analysis and understanding the costs. I think the next steps would be to copy the historical cache into long-term, infrequently accessed (cheap) storage. This can still be available for researchers or portions restored if they are commonly required. This reduces costs, provides a fallback, and gets us comfortable working with the scope of the problem. Next, we garbage collect unreachable and old paths.
Note: There is also an investigation to look at Tigris as a better storage provider; it remains to be seen if this is viable.
Then we turn our attention to the problem of the rate of growth. Here we can split the problem into providing a CI/building cache, and one for users. The CI/building cache can be cleaned out more aggressively and is available for PRs, staging, etc. The user-facing cache would be more long-lived. This comes with additional complexity, but can be used to slow the rate of growth.
Luckily, we got an extension of support from AWS sponsoring our cache, but we do need to continue efforts to reduce costs. This sort of work is high-impact, requires a budget, and requires commitment, and thus should be coordinated. Related to this need is a proposal to establish the role of a Executive Director. That is a larger topic, but such a role would be responsible for coordinating and enabling the people working on this problem.
A specific effort I would like to pursue is to discover who are the high-volume users of the cache and either ask them for funding or work with them to mitigate the costs they incur.
Another solution is to encourage enough Nix adoption as critical infrastructure so that we can obtain more partnerships, sponsors, grants, and funding to cover the cache costs.
I can mostly agree with tomberek here.
From a purely financial aspect, moving less accessed NARs to something like S3 Glacier is a good solution for our currently cached objects, and long term we should seek to collaborate with more organizations and companies for continued funding (see my previous answers on how I believe these should be handled).
Now regarding actual garbage collection, more "transitive" items in the cache (such as those from staging and PRs) should be some of the first to go as it's highly unlikely they will ever be used again (this could also be done on a regular basis for builds older than X years). I would also not be against GCing "leaf" packages, as in many cases they will have very little effect on builds for consumers and (especially as nixpkgs grows larger) can account for a good chunk of packages
If the worst comes to the worst and we find ourselves in a position of requiring mass garbage collection, I believe we should set aside packages we deem critical (similar to what we do in releases, but obviously a bit larger) that won't be garbage collected, based on community feedback and S3 statistics. We should also prioritize keeping as many cached sources as possible, since having old URLs 404 is much more likely than a given build not being reproducible
Disk space is cheap these days, but caching the whole history of the world can still be wasteful; we must balance availability with utility, and a good start to doing so would be to survey stakeholders in the ecosystem to find out what people are doing that would require all these historical build artifacts. As discussed in my answer to #16, I believe achieving full reproducibility is an important step towards addressing this, since we could readily empty most of the cache, confident that we could rebuild anything we needed in the future. In the meantime, we can address cache bloat by other means, such as distributing artifacts among more parties (no longer relying exclusively on S3), or by moving more of the cache to "cold" storage.
I think it would be great if we could least keep everything that was ever merged into master.
I am not that worried about store paths that were only ever used in CI and staging and never merged into master. I think it would be fine to delete those, if we manage to identify them and it saves us a significant chunk of the storage costs. I do think we should keep everything we ever fetched from the web, if possible though.
I hope that content addressed derivations and increased reproducibility, if we manage to keep track of that, could work in our favor to slow the rate of growth.
My first and foremost concern about garbage collecting the cache is the possibility of deleting data that is otherwise inaccessible online, whether through services shutting down or authors deleting their work. Base sources should be kept, and everything else should be able to be reproduced from derivations in Nixpkgs from commits across time. Anything that is infrequently accessed but kept in this manner should be archived to cheap and slower storage. I also agree with the others who mentioned maintaining a separation between the cache facing users and the cache presented by CI. Packages from long-dead PRs don't need to be kept cached forever, and should be more aggressively trimmed, especially with the high turnover that those packages might see as they get refined before merging.
Question
The cache is to big and it can't keep growing at the current rate. What is your opinion on how and what and if it should be garbage collected.
Candidates I'd like to get an answer from
No response
Reminder of the Q&A rules
Please adhere to the Q&A guidelines and rules