NixOS / infra

NixOS configurations for nixos.org and its servers
MIT License
229 stars 94 forks source link

Self-hosted releases.nixos.org #408

Open delroth opened 5 months ago

delroth commented 5 months ago

Assuming that 100% of the eu-west-1 S3 bill is releases.nixos.org (there are a few other minor buckets, e.g. tarballs.nixos.org), these mostly static ~32TB of data are costing us between $1.5K-$3.5K/month right now[^1].

This is IMO a perfect opportunity to start ramping up S3 self-hosting for the NixOS infra. Unlike cache.nixos.org:

2x SX134 at Hetzner with 10Gbps uplinks would cost us ~$659/month for 2x {2 x 960 GB flash + 10 x 16 TB hard drive (128TB with 2 disks failure tolerance)}, we can then also use the extra capacity in the future to consider self-hosting parts of cache.nixos.org to offset bandwidth costs.

We would still CDN this via Fastly.

[^1]: The cost is rapidly increasing from factors that look organic in nature, but more analysis might be able to find artificial sources that are increasing the costs on S3. In any case, $1.5K looks like the minimal baseline costs we could get down to.

edolstra commented 5 months ago

The current plan is to reduce the S3 bill for releases.nixos.org by expunging old releases (see #397). Self-hosting sounds very risky to me. Releases are not in fact easy to reconstruct. If the release server were to die entirely, there is no way we can feasibly reconstruct it.

Self-hosting could be an option for releases that have been removed (or glaciered) from releases.nixos.org.

delroth commented 5 months ago

The current plan is to reduce the S3 bill for releases.nixos.org by expunging old releases

But... this does nothing for data transfer costs, which are 85% of the S3 bill? Am I missing something?

image

Releases are not in fact easy to reconstruct. If the release server were to die entirely, there is no way we can feasibly reconstruct it.

Can you be more clear here? What cannot be reconstructed? Channel scripts just fetch from Hydra + cache (from where AFAICT the data is not being removed) and runs two data extractor programs (nix-index + nix-generate-debuginfo) which, while slightly annoying, don't seem like it would be majorly difficult to re-run on old data.

delroth commented 5 months ago

Also, do you realize the contradiction in claiming this as a problem:

If the release server were to die entirely, there is no way we can feasibly reconstruct it.

But then also suggesting:

The current plan is to reduce the S3 bill for releases.nixos.org by expunging old releases

What data risk loss do you actually care about if you're suggesting deleting 75% of the data?

edolstra commented 5 months ago

Reconstruction cannot depend on cache.nixos.org, since we're going to GC that too. In particular ISOs etc. will be deleted.

Also, our disaster recovery cannot involve running some script that doesn't exist and that would take days to run.

What data risk loss do you actually care about if you're suggesting deleting 75% of the data?

I care about the releases that we don't expunge (and the ones that we do would be on Glacier, so we can always bring them back).

The bandwidth increase in eu-west-1 is weird since in March 2023 it was 1832 GB ($54.96) and as recent as October 2023 is was just 1865 GB ($167.93). So the increase to 34204 GB ($2958.59) is hard to explain. Maybe there is a Fastly misconfiguration that is causing the CDN to be less effective?

delroth commented 5 months ago

Reconstruction cannot depend on cache.nixos.org, since we're going to GC that too. In particular ISOs etc. will be deleted.

OK, but then you also don't care about reconstruction, since you're deleting the original data. Note that I still think that's a terrible idea for stuff that's linked to a channel bump (and only stuff that was a channel version at some point would be on releases.nixos.org). cc @edef1c because I was not under the impression that this was the plan

Also, our disaster recovery cannot involve running some script that doesn't exist and that would take days to run.

I really don't see why not. Unlike the cache, releases.nixos.org is not in much of a critical path, and the only stuff that really needs to recover quickly and have high availability would be the latest version for each channel.

Maybe there is a Fastly misconfiguration that is causing the CDN to be less effective?

Not that I can tell, and there have been no configuration changes since Sept 2023. I'm waiting for @zimbatm to provision me the right AWS access to look through the Athena logs.

zimbatm commented 4 months ago

@edef1c is this still relevant?