Open delroth opened 5 months ago
The current plan is to reduce the S3 bill for releases.nixos.org by expunging old releases (see #397). Self-hosting sounds very risky to me. Releases are not in fact easy to reconstruct. If the release server were to die entirely, there is no way we can feasibly reconstruct it.
Self-hosting could be an option for releases that have been removed (or glaciered) from releases.nixos.org.
The current plan is to reduce the S3 bill for releases.nixos.org by expunging old releases
But... this does nothing for data transfer costs, which are 85% of the S3 bill? Am I missing something?
Releases are not in fact easy to reconstruct. If the release server were to die entirely, there is no way we can feasibly reconstruct it.
Can you be more clear here? What cannot be reconstructed? Channel scripts just fetch from Hydra + cache (from where AFAICT the data is not being removed) and runs two data extractor programs (nix-index + nix-generate-debuginfo) which, while slightly annoying, don't seem like it would be majorly difficult to re-run on old data.
Also, do you realize the contradiction in claiming this as a problem:
If the release server were to die entirely, there is no way we can feasibly reconstruct it.
But then also suggesting:
The current plan is to reduce the S3 bill for releases.nixos.org by expunging old releases
What data risk loss do you actually care about if you're suggesting deleting 75% of the data?
Reconstruction cannot depend on cache.nixos.org, since we're going to GC that too. In particular ISOs etc. will be deleted.
Also, our disaster recovery cannot involve running some script that doesn't exist and that would take days to run.
What data risk loss do you actually care about if you're suggesting deleting 75% of the data?
I care about the releases that we don't expunge (and the ones that we do would be on Glacier, so we can always bring them back).
The bandwidth increase in eu-west-1 is weird since in March 2023 it was 1832 GB ($54.96) and as recent as October 2023 is was just 1865 GB ($167.93). So the increase to 34204 GB ($2958.59) is hard to explain. Maybe there is a Fastly misconfiguration that is causing the CDN to be less effective?
Reconstruction cannot depend on cache.nixos.org, since we're going to GC that too. In particular ISOs etc. will be deleted.
OK, but then you also don't care about reconstruction, since you're deleting the original data. Note that I still think that's a terrible idea for stuff that's linked to a channel bump (and only stuff that was a channel version at some point would be on releases.nixos.org). cc @edef1c because I was not under the impression that this was the plan
Also, our disaster recovery cannot involve running some script that doesn't exist and that would take days to run.
I really don't see why not. Unlike the cache, releases.nixos.org is not in much of a critical path, and the only stuff that really needs to recover quickly and have high availability would be the latest version for each channel.
Maybe there is a Fastly misconfiguration that is causing the CDN to be less effective?
Not that I can tell, and there have been no configuration changes since Sept 2023. I'm waiting for @zimbatm to provision me the right AWS access to look through the Athena logs.
@edef1c is this still relevant?
Assuming that 100% of the eu-west-1 S3 bill is releases.nixos.org (there are a few other minor buckets, e.g. tarballs.nixos.org), these mostly static ~32TB of data are costing us between $1.5K-$3.5K/month right now[^1].
This is IMO a perfect opportunity to start ramping up S3 self-hosting for the NixOS infra. Unlike cache.nixos.org:
2x SX134 at Hetzner with 10Gbps uplinks would cost us ~$659/month for 2x {2 x 960 GB flash + 10 x 16 TB hard drive (128TB with 2 disks failure tolerance)}, we can then also use the extra capacity in the future to consider self-hosting parts of cache.nixos.org to offset bandwidth costs.
We would still CDN this via Fastly.
[^1]: The cost is rapidly increasing from factors that look organic in nature, but more analysis might be able to find artificial sources that are increasing the costs on S3. In any case, $1.5K looks like the minimal baseline costs we could get down to.