NixOS / foundation

This is the home of the NixOS Foundation
61 stars 16 forks source link

[Short Term Strategy and Priorities] Migration of S3 Bucket Payments to Foundation #82

Closed refroni closed 4 months ago

refroni commented 1 year ago

We might have to move the cache.nixos.org S3 bucket payment to the Foundation in a near future. Need to find a sustainable way to keep it running

zimbatm commented 1 year ago

Related discussion: https://discourse.nixos.org/t/the-nixos-foundations-call-to-action-s3-costs-require-community-support/28672/74

refroni commented 1 year ago

Listing out all options/possibilities that have been brought up or being explored below. Please add in anything that might be of interest to bring up/discuss/alternative options on the topic.

Thank you to joepie91 and raitobezarius for helping put this initial list together from the matrix/discourse discussions:

  1. S3 with partial/full sponsorship from AWS (sponsor dependency)
  2. S3 with "intelligent tiering" (cost reduction by automatically moving 'cold' data to glacier, AIUI), exact savings unknown with current data but likely significant
  3. Cloudflare R2: $15/TB storage plus ‘operation fees’, free traffic; possibly sponsorable
  4. Backblaze B2: $5/TB storage, $10/TB traffic, no minimum storage, supposedly free migration from S3
  5. Wasabi: $6/TB storage, free traffic up to 100%-of-data egress, 90 days minimum storage
  6. Storj: $4/TB storage plus ‘segment fee’, $7/TB traffic, no minimum storage, supposedly free migration from S3, unknown reliability of underlying 'decentralized' storage suppliers
  7. Telnyx: $2.30/TB storage plus 'operation fees', free traffic;
nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/the-nixos-foundations-call-to-action-s3-costs-require-community-support/28672/96

zhaofengli commented 1 year ago

Note that with Backblaze R2, there are no egress fees if the data is proxied though Cloudflare [1].

dinvlad commented 1 year ago

And also no fees via Fastly, I believe, since they're all part of the same alliance.

zimbatm commented 1 year ago

My impression is that only (1) and (3) are realistic short-term. Since egress is free in (3), we can then move laterally to a better long-term solution.

(2) can be removed as Intelligent Tiering is already turned on. EDIT: See https://github.com/NixOS/nixos-org-configurations/blob/f27bdec45066d828dba681cc0d2655a4ad8edb0e/terraform/cache.tf#L9-L16

(4) I believe B2 is mainly designed with backup scenarios in mind. It's optimized for low storage costs, that need to sometimes become available. I know that Domen tried it out for Cachix, and it wasn't reliable enough then. It might work if the narinfos are stored separately, but that requires more design.

(5), (6) and (7): we don't know how reliable those are.

fleaz commented 1 year ago

2. S3 with "intelligent tiering" (cost reduction by automatically moving 'cold' data to glacier, AIUI), exact savings unknown with current data but likely significant

Just a heads up reagarding "Intelligent tiering": Just activating it, will create costs: $0.0025 per 1k Obj per Month for monitoring of access frequency, and $0.01 per every 1,000 objects that get moved to a different class. So depending on the access pattern on the files,this could probably even increase your costs compared to leaving everything in the default storage class.

fleaz commented 1 year ago

and it wasn't reliable enough then

What do you mean with "not realiable enough" exactly? Because I assume they did not loose data. If it's regarding performance, that can probably be ignored up to a certain degree due to the heavy use of Fastly in front of it?

zimbatm commented 1 year ago

When building your own config, the derivations most likely don't exist in the cache, but Nix will still ask the cache as it doesn't know the distinction. Because it's a new hash/path, it will always go through the CDN and hit upstream. Because Nix will wait for the cache reply before deciding to build locally, latency SLA has an impact on how fast the build happens. Because Nix will hang/retry/fail if the cache returns a 5xx request, uptime is also important. So while 90+% of the requests are cached, there is a small percentage that can never be cached, and is also important for the user experience.

If we had two backends; one for the narinfos, and one for the NAR files, then we could store the NAR files in B2/Storj/... while still providing better uptime and latency SLA for the narinfo files. It's an interesting avenue but I don't if we can pull that off in the short term.

nh2 commented 1 year ago

Please add in anything that might be of interest

@refroni Since I cannot edit your post https://github.com/NixOS/foundation/issues/82#issuecomment-1575617301 directly, making edit suggestions here, perhaps you could add them:

(Edit: Discourse link for this suggestion.)

Telnyx: $2.30/TB storage plus 'operation fees', free traffic;

Also, I think there should be a post that summarises our options of making a transfer-out of the S3 data cheaper than $32k, if needed:

RaitoBezarius commented 1 year ago

Please add in anything that might be of interest

@refroni Since I cannot edit your post #82 (comment) directly, making edit suggestions here, perhaps you could add them:

  • 8. Self-host on Hetzner-dedicated+Ceph: $2.3/TB storage, $0.15/TB traffic, run by community infra team

I'd dare to say this is not very short-term actionable :P.

AmineChikhaoui commented 1 year ago

(2) can be removed as Intelligent Tiering is already turned on. EDIT: See https://github.com/NixOS/nixos-org-configurations/blob/f27bdec45066d828dba681cc0d2655a4ad8edb0e/terraform/cache.tf#L9-L16

@zimbatm That's just a lifecycle and not intelligent tiering. Intelligent tiering would monitor and automatically move across storage classes without needing a lifecycle afaik.

7c6f434c commented 1 year ago

I'd dare to say this is not very short-term actionable.

Well, there is a multi-node Ceph test in NixOS tests — is doing the same thing naively going to lose performance or reliability too? (But the very first question is to figure out the level of redundancy, sure)

nh2 commented 1 year ago

I'd dare to say this is not very short-term actionable :P.

@RaitoBezarius Why not?

My understanding is that "short-term" means along with the Deadline - Aiming for July 1st from the Discourse post.

As mentioned on Discourse, the company I co-founded uses Ceph-on-NixOS for hosting our production data.

It is very feasible to buy 500 TB as 3 Hetzner SX servers right now, enable the corresponding NixOS modules, and start transferring data to it e.g. tomorrow.

When I posted this suggestion, had in mind that this setup, plus finishing the transfer of the ~500 TB from S3 to this cluster, would be finished before above deadline -- thus short-term.

So I think it makes sense to add it to the list of approaches to discuss.

RaitoBezarius commented 1 year ago

I'd dare to say this is not very short-term actionable :P.

@RaitoBezarius Why not?

My understanding is that "short-term" means along with the Deadline - Aiming for July 1st from the Discourse post.

As mentioned on Discourse, the company I co-founded uses Ceph-on-NixOS for hosting our production data.

It is very feasible to buy 500 TB as 3 Hetzner SX servers right now, enable the corresponding NixOS modules, and start transferring data to it e.g. tomorrow.

When I posted this suggestion, had in mind that this setup, plus finishing the transfer of the ~500 TB from S3 to this cluster, would be finished before above deadline -- thus short-term.

So I think it makes sense to add it to the list of approaches to discuss.

I mean, would you explicitly join the sysadmin efforts to maintain such a cluster on the long term? If so, yes, this is a valid short term proposal.

But the current infra team cannot take this load.

nh2 commented 1 year ago

would you explicitly join the sysadmin efforts to maintain such a cluster on the long term? If so, yes, this is a valid short term proposal.

But the current infra team cannot take this load.

Yes, I would join the those efforts, provided that there will be a reasonable number co-sysadmins to share maintenance with, so that the load on each individual is low. I would also be happy to share my existing knowledge regarding setup and Ceph operations.

Of course there could also be the option to spend some of the current $9k/month to pay a sysadmin for e.g. a few hours per month, or the twice-yearly NixOS upgrades.

refroni commented 1 year ago

Adding this into the general options for review as well. I would say it would be fair to assume that this is something we would also want to look into as an option for the longer term which can be looked into further with the Infra team.

endgame commented 1 year ago

(2) can be removed as Intelligent Tiering is already turned on. EDIT: See https://github.com/NixOS/nixos-org-configurations/blob/f27bdec45066d828dba681cc0d2655a4ad8edb0e/terraform/cache.tf#L9-L16

That lifecycle rule is not moving objects into Intelligent Tiering, it's moving them to the Standard - Infrequent Access storage class. But it is probably getting most of the benefit of Intelligent Tiering without the automation charges. As I said on Discourse:

Every time I've looked at S3 Intelligent Tiering in my own work, the $0.0025 per 1,000 objects automation fee makes me nervous. According to @edolstra, there are 667M objects in the cache.nixos.org bucket, so you're paying $1667.50/month in automation fees, and 3/4 of the bucket is already in Infrequent-Access tier by some mechanism or other. So Intelligent Tiering needs to move a lot of stuff to smarter storage classes to come out ahead (or we only turn it on for large NARs, or something).

It is possible that we'd get some cost savings by moving even older stuff into Glacier Instant Archive, but I'm not sure whether that's going to immediately bite us when we want to move the entire bucket somewhere else.

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/s3-update-and-recap-of-community-call/28942/1

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-s3-short-term-resolution/29413/1

thufschmitt commented 4 months ago

Closing in favor of https://github.com/NixOS/foundation/issues/86 since the “short term” is taken care of