NixOS / infra

NixOS configurations for nixos.org and its servers
MIT License
230 stars 95 forks source link

cache.nixos.org: fastly<->s3 throttled? #212

Open srhb opened 2 years ago

srhb commented 2 years ago

Affected service cache.nixos.org

Describe the issue For, I think, a few weeks now, I've noticed very poor download speeds (~1-2MB/s max) on what I will describe as "fresh paths" in the nixos cache -- paths that I expect have not recently been fetched by anyone else (in my region?) After a download, an immediate redownload of the same path is extremely fast, like it used to be (~100MB/s). This makes me suspect the problem is "behind" the Fastly cache, which I assume is s3.

My test to verify it wasn't a nix issue was a direct wget (below) of openjdk11, but because of the nature of the problem, you may have to find another, arbitrary hugepath to test with that hasn't been warmed recently (in your region?)

wget https://cache.nixos.org/nar/1bd1ji8wghy39rnk5xsb56vw81nbq961rcm42lxid5j4iga9pivc.nar.xz
SuperSandro2000 commented 2 years ago

Noticed the same lately

mweinelt commented 2 years ago

The firefox-unwrapped package is well cached for a comparison:

wget https://cache.nixos.org/nar/17vh4y9w3xnwibw6a11bbpjd8zhdfkfndvjl3g8bx92pwrbsd25h.nar.xz

Getting 4-5 MB/s uncached and 24 MB/s cached.

nixos-discourse commented 2 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/slow-cache-nixos-org/20131/7

domenkozar commented 2 years ago

cc @thoughtpolice

edolstra commented 2 years ago

Maybe https://github.com/NixOS/nixos-org-configurations/commit/5e0fb0146b6ba4f053d452c1c3e0e28ac06ee443 had some unintended performance consequences? That was on June 23rd.

Edit: that may well be the reason because previously initial downloads were done from a location close to us-east-1. I can try reverting that change...

vcunat commented 2 years ago

On a quick glance, Shielding could also ~increase~ decrease traffic to S3 significantly.

edolstra commented 2 years ago

My understanding is that it only increases traffic in the case where two users start downloading a file at the same time, which is probably not extremely common.

wahjava commented 2 years ago

the speeds are still terrible:

https://asciinema.org/a/u9gzwyK1JG1rh9fnwDiLymsZ6 (2 days older)

https://asciinema.org/a/y58Gulk1W1PEswPSEjSCCEJSP (now)

Could someone maybe file a ticket with Fastly, it's possible they may have some network congestion in path to origin (S3?), my 0.02 XDR ?

Thanks in advance

thoughtpolice commented 2 years ago

I no longer work for Fastly, and have not for quite a while, but I can provide some advice:

@edolstra Shielding is absolutely the reason the latency is good, and removing it makes the latency bad. I know this, because I personally advised Graham long ago to enable shielding when I was on a Zoom call with him — at the time people were having other issues but I encouraged it as a quick performance boost, which it was — and we saw the effects, and I measured them. I know for a definitive fact this will not change anytime soon, either.

Shielding does not add latency; it significantly reduces it due to Fastly's network architecture. This is not well understood to be fair; but it's common knowledge for advanced users of the platform (and it is public knowledge, too, just not well documented or explained.)

There's two things happening when you enable shielding:

Let's assume that the S3 bucket is in us-east-1, and that the shield POP is nearby in IAD or whatever. The round-trip time for a single packet is, let's say, 15ms. Then a TCP handshake which is 3xRTT is a total of 45ms. So you can assume the latency of every connection (OK, we'll ignore HTTP/2 support on the origin here for simplicity) from the shield to the origin is 45ms, at minimum.

So let's say a user is in China. They connect via Singapore ("edge POP"). Packet RTT of ~20ms from client -> edge. So a single handshake for a single narinfo, without taking anything else into account, is at minimum ~60ms, starting from China. Add up the 45ms from the shield-origin link, and we get 105ms minimum latency. Finally, take the time between the edge and the shield: let's say it's 200ms RTT to cross the Atlantic. Then the total time would be 3x15 + 3x20 + 3x200 = 45 + 60 + 600 = 705ms/req latency. This is extremely bad already, but it is magnified in the case of a Nix binary cache, because a binary cache fundamentally requires head-of-line blocking as it traverses the dependency graph looking at each .narinfo file in order to find the needed dependencies. The only download parallelism that is possible is through the "fan out" that comes from traversing the References field. It is also the worst case of a very small file being fetched for a comparatively large amount of overhead; last time I ran numbers, the ratio of "average narinfo filesize" vs "overhead of average TCP TLS handshake" was surprising, if I remember right.

Now let's say there was no shield; the system instead just connected from Singapore directly to the origin; maybe with extra added latency for the extra few hundred miles, but no extra set of 3x roundtrips; say the last hop is 220ms RTT instead. Then you have 3x20 + 3x220 = 720ms/req. Which is no better, right? So why bother with shielding? But there are two advantages to the shielding setup, one of which is obvious, and the other less obvious: the cache in front of S3 means you hit the S3 origin less often, so you pay less money, which is always good. There is also the advantage that a single cache supplies cache hits for the entire globe over time; so if one person in Singapore and one person in London both download the same object from the cache, 5 hours apart — only one of those goes to the origin, even though they were completely separate requests in space and time. It's significant for systems like S3 which also tend to rate limit clients rather aggressively, in my experience. Combined, these two facts mean you significantly boost your real, honest-to-God cache hit ratio. You simply talk to S3 less, over time, from everywhere on the globe.

The reductions in origin traffic can, without exaggeration, be absolutely massive in some cases. Think about something like GitHub Actions: the runners are continuously running from potentially all over the globe. They are constantly requesting many similar artifacts, things like nixpkgs-unstable, and those derivations in turn will have some extremely common and stable ancestor derivations that are widely needed. Think glibc + glibcLocales, which CI systems all over the world re-download copies of every single time they start up; but it's a derivation that rarely changes until a staging merge occurs. (If this CI system pins the nixpkgs version through a locked flake, it will change even less frequently than that.) You could in theory shove that object into the cache on first miss, and set a 40+ day expiry time. You'd literally only do an origin fetch once every 40 days. That's a very, very big amount of traffic savings when taken in aggregate. And there are probably dozens of "stable" but very commonly requested expressions like this (linux-headers, aws-sdk, etc...)

So where does the latency part come in? Because the network architecture makes an extremely big optimization: the "TCP round trip" between the edge POP and the shield POP I described does not exist. Every data center keeps prewarmed TCP connections between every other — precisely for the shielding case, where a Fastly server is simply going to talk to another Fastly server somewhere far away. This means that shielding not only improves cache hit ratio dramatically, it also slashes latency for globally distant users to a large degree. (It might not surprise you to learn that "I need a server to sit in front of an S3 bucket, to reduce costs and serve files faster" is an extremely, extremely common use case for Fastly's clientele; it's a very well understood problem.)

So the actual latency for a request in the above scenario, from China, to Singapore, to IAD, to us-east-1, is only 3x15ms + 3x20ms + 1x200ms = 295ms latency in this hypothetical scenario (the 1x200ms is because a packet still has to actually cross the ocean.) It will potentially be less than this in many cases, thanks to HTTP/2 multiplexing between the origin and the shield eliminating the 3x20ms RTT, and client->edge HTTP/2 multiplexing eliminating the 3x15ms after opening. So in theory you'll get as low as 20ms + 15ms + 200ms = 235ms for cross-globe communications, after warm up. This is literally faster than any alternative scenario you can think up, outside of, I don't know, weird 0RTT tricks with TLS 1.3 or whatever (maybe, if that's safe, don't ask me, I don't know?), and it's probably approaching the realistic limits of c trying to race across the globe if I had to guess. Someone else can do the math.

I don't know what other problems might be arising; the global internet backbone is hostile, fickle, and poorly understood — even for well-connected engineers at major service providers. Again: I do not work for Fastly anymore, and am not privvy to anything in that sphere any longer. If they have information, though, I know they'd provide it; the customer support is genuinely very good, even for Open Source projects. Don't be afraid to ask them a bunch of questions if you have reason to believe they can help (and the more evidence you can provide, the better.)

Moral of the story: do not ever disable shielding for the Nix cache. It will always be a win and it will never be a loss[^1]. In fact shielding alone is a good enough reason to continue using Varnish Configuration Language; at the time I worked there the Serverless WebAssembly offering did not offer an equivalent to shielding (you'd have to build it yourself.)

[^1]: There is exactly 1 scenario that is screwed up by shielding, and that's the calculation of the global cache hit ratio, because it will potentially count a single request as both a miss (on edge) and a hit (on shield). But let's be honest, nobody here is sitting around optimizing the hit ratio on cache.nixos.org, and I'm not interested in spending too much time doing it myself, so it's a moot point. And besides, you can fix this one yourself by using Fastly's log pipeline to ship custom logs to a BI tool of your choice, like ClickHouse, where you can slice and dice the hit ratio in whatever way you desire.

thoughtpolice commented 2 years ago

Also, the other Giant Elephant in the room that needs to be checked is to make sure that Streaming Miss is enabled: https://docs.fastly.com/en/guides/streaming-miss

This is another extremely important feature that is much, much more self explanatory, and will dramatically reduce the TTFB for large objects; for instance a 500MB nar will begin streaming bytes to the client instantly once the origin responds, instead of first waiting to populate the shield cache by downloading all 500MB, then moving forward. It's effectively another way of improving latency, and reducing HOL blocking.

I honestly can't remember if streaming miss is enabled in the cache.nixos.org configuration, though. It's a huge boost in my experience for Nix caches with moderately sized objects; even things in the 20MB range have a perceptible latency reduction, and it's visible just from watching the terminal UX progress.

zimbatm commented 2 years ago

Thanks, I loved reading the detailed explanation. Also deployed 36d64bc5ab5e3953ccfbb6f85dc1228bdafed127 that enables streaming-miss on cache.nixos.org.

delroth commented 1 year ago

I encountered this issue again today while downloading the latest installer image (I had the bad luck of doing so about 5min after a channel update, and I think I was the first person to download the minimal ISO...).

I strongly suspect there's some artificial throttling / bandwidth limiting going on between fastly and S3. The streaming-miss download speed was exactly 1MB/s and aligned to 1MB sized chunks, it's very visible in e.g. wget output. It took me 10min to download the install ISO on a connection that should be able to do so in like 10s.

See attached screen recording of wget, I refuse to believe that the exact round values are a coincidence caused by latency effects as explained by @thoughtpolice :-)

https://user-images.githubusercontent.com/202798/197495487-f8c0dcb8-d753-4ea4-b7e9-faaf02ae9585.mp4

nixinator commented 1 year ago

the global internet backbone is hostile, fickle, and poorly understood

@thoughtpolice , and that was amazing explanation. I'm trying to fix the internet, because reading your comment, reminds me how complex and broken it has become.

One day stuff will be distributed on the nix network.. ;-).

If lol can do it, so can we.

https://technology.riotgames.com/news/fixing-internet-real-time-applications-part-ii

@thoughtpolice , join me, i'm gong to need your skillzzzzz... do you know haskell?

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/the-nixos-foundations-call-to-action-s3-costs-require-community-support/28672/59

nixos-discourse commented 10 months ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-11-14-re-long-term-s3-cache-solutions-meeting-minutes-4/35632/1

delroth commented 8 months ago

Reopening this because it's definitely not fixed, even though we don't really have a good way forward to address this problem.

nixinator commented 8 months ago

it's very visible in e.g. wget output. It took me 10min to download the install ISO on a >connection that should be able to do so in like 10s.

what is your connection, and who is your ISP? There is a ton of infra between you and the server your hitting... in cases like this, it's not that's it is slow, it's the IT WORKS AT ALL.

Reproducing these problems is almost impossible, because the complexity of what a traditional global CDN.

I think mirroring can go a long way to solve these problems, with good mirrors, in every major IXP location on earth, you can just simply have http server, with very little 'middleware inbetween'.

I've been trying to tell programmers , that networks are more reliable when they are simple , have few moving parts, have automation, but not too much automation. The @thoughtpolice answer is amazing, but you can see the complexity of running a global CDN. it's not easy. The more end to end you can make a network, the better it is. Too much middleware, and 'clever' boxes and code that sit between you and the data, is never a good thing...

However, a single server, running nginx, connected to a 400G fibre line is, whats more simple than that? Put one next to every major IXP on earth. Job done!

The internet is now at a stage where scaling up , and reducing complexity is possible. The whole networking industry is stagnant, but that is another topic for another day.

but alas, the world we live in wants complexity... so were stuck with unreproduable networks...

but maybe one day, I'll get my way.

nixinator commented 8 months ago

@delroth , i'm going to raise your moderation of this with the community team. This is quite clearly bullying.

and I HAVE HAD ENOUGH of you.

delroth commented 5 months ago

409 fixed a fairly obvious issue that we had with the 2-layer caching setup specifically for releases.nixos.org. If someone still manages to reproduce the throttling problem (see https://github.com/NixOS/infra/issues/212#issuecomment-1288732617 above for some very clear symptoms to try and match against) it would be useful to report back here!