Docs pages take seconds to render

sparkiegeek commented 3 years ago

Summary

The docs pages take too long to render, ~4s for the index

Process

Visit https://maas.io/docs/ with devtools open or otherwise time how long it takes to load the page

Current and expected result

It takes 3-4s, and should be less than a second

Screenshot

https://github.com/canonical-web-and-design/upptime/commits/HEAD/history/maas-io-docs.yml

anthonydillon commented 3 years ago

Do you get this on every hit? Content-cache should be serving this without a round trip. Unless you are the first visitor for a long time.

@nottrobin any ideas here?

nottrobin commented 3 years ago

Yeah we should look into the traffic on that page. We need to bear in mind:

there are 4 cache nodes, so there needs to be at least one visitor per minute to each node for the cache to remain fresh
the cache also varies by accept-encoding and maybe cookie, which will again mean the traffic has to be higher

Maybe we should just play with extending the cache time?

sparkiegeek commented 3 years ago

Given we can be precise about when the content changes (Discourse edit time) can't we cache for days?

sparkiegeek commented 3 years ago

Do you get this on every hit? Content-cache should be serving this without a round trip. Unless you are the first visitor for a long time.

https://canonical-web-and-design.github.io/upptime/history/maas-io-docs seems clear...

nottrobin commented 3 years ago

@sparkiegeek Yes using the modified time to generate an etag might help us, but it might not. For us to send a 304 not modified response we still have to make a call to Discourse to find out if anything's changed, which is likely to have similar latency to requesting and returning the page itself - the only thing you really save on is the time spent downloading the actual content, which isn't that significant.

So the best way to make things really snappy is to tell the browser & cache that they don't even have to check, they can just returned the cached version.

stale-while-revalidate is a great help here - at the moment our max-age is 5 minutes, our stale-while-revalidate is 6 minutes. So any visitor in that 6th minute will get the cached copy quickly while it at the same triggering a cache-refresh in the background. I had thought that most of our pages would have enough traffic always receive a visitor in that 6th minute (for each cache variation), but it looks like that was naive.

I'm now thinking the best strategy might be to make max-age much shorter and stale-while-revalidate much longer. My thinking is that we might as well always be using the opportunity of a visitor arriving to refresh the cache, so max-age really just needs to be long enough to protect the application from being overloaded - maybe 10 seconds? stale-while-revalidate can probably safely be quite long, as hopefully anyone seeing stale content would at least try refreshing, at which point they'd get fresh content.

I think we should look into traffic on docs pages and try to work out how many varieties of accept-encoding we get to arrive at a workable value for stale-while-revalidate.

@tbille you might find this interesting.

nottrobin commented 3 years ago

Discussing with @pmahnke, it looks like maas.io/docs probably sometimes goes a few hours without a visit. So if we wanted to provide the best experience on this page stale-while-revalidate should be at least that long. Other docs pages will go far longer, but there's presumably a limit to how long we can make the cache.

We're thinking maybe make stale-while-revalidate 1 day and max-age 10 seconds, in flask-base so it becomes the default everywhere.

Why make max-age so short? Well given that we have stale-while-revalidate for responsiveness, max-age's purpose is effectively microcaching to protect the application from too much load. And making this too long could cost us because, consider the following scenario:

Someone is going to make a change to a specific rarely-visited document (let's say, visited once an hour) in MAAS docs, so they:

First visit the document page to see what it looks like (refreshes the cache at this point)
Update the content in Discourse
Reload the page to see their changes
When they don't see their changes (because of stale-while-revalidate) they reload to see if their content appears

If max-age is just a few seconds then the time between 1. and 3. (maybe as short as 30 seconds?) will be long enough for it to have expired. This means that the actor above will cause the cache to refresh, so any future visitors in the next day will see the fresh content. However, if max-age was, say, 5 minutes, then the above actor would presumably give up waiting, accept that the cache will update at some point. This would leave the cache stale for the next day, so the first real visitor who comes to it might see stale content.

Any thoughts on this proposal?

nottrobin commented 3 years ago

Maybe 10 seconds it too short? If we consider that the number of requests we then get to the application will be:

( 1 visitor * 4 cache nodes * 10? "encoding" variations * 500 pages on the site ) / 10 second window

Then that's 2000 hits a second. That's a lot. Maybe max-age needs to be 1 minute?

nottrobin commented 3 years ago

I just checked, and both Firefox 85 and Chrome 88 request pages with the exact same header in the exact same order:

Accept-Encoding: gzip, deflate, br

So hopefully the vast majority of our traffic gets to use a single cache per page per node.

nottrobin commented 3 years ago

We gathered some pretty nice rich stats on the various accept-encoding strings on snapcraft.io, which is presumably gonna be somewhat representative of traffic on most of our sites.

None (presumably from curl and similar) is the most common. Next is, as expected, "gzip, deflate, br" at about half as much traffic as None. Presumably this represents almost all browsers. This is what's really interesting to us.

Everything else is basically negligible, but there's a long tail of lots of other accept-encoding strings that come in occasionally, with some having tiny spikes every now and then. Of these, "gzip" and "br,gzip" (not to be confused with "br, gzip") were the most common. At its peak, br,gzip rose to around 10% of the browser traffic, but this was short-lived, so can probably be largely disregarded.

(Soon @tbille will help me match browser names to the accept-encoding strings, so we might be able to see who's responsible for gzip and br,gzip)

Long story short, we actually don't need to worry about accept-encoding variations really. Almost all browser traffic will be using the same cache variation.

Which means we can update our formula above:

( 1 visitor * 4 cache nodes * 1 "encoding" variation * 500 pages on the site ) / 10 second window

Giving us only 200 requests per second, which sounds much more manageable. If we extended this window to 60 seconds, this will only be a maximum of 33 requests per second or so, which is probably fine, especially when split between the 5 app units we normally have (7-ish per app node). Bear in mind this is a worst-case scenario where a) we have 500 pages on a site and b) traffic is distributed well enough for all 500 pages to receive requests within the cache window.

However, as Joel Sing pointed out in our IS meeting, we still risk having a thundering herd problem, because once max-age is passed, it will take the cache some time (e.g. 3 seconds) to refresh the cache. During this time, the cache could send all requests to our app to all refresh the cache at once. If this happened, we could get significant load during this cache refresh. This can be mitigated using the proxy_cache_lock, so IS are looking into the cache-lock setting for us. We'll try changing the headers once we hear back.

cristinadresch commented 3 years ago

@nottrobin what is the latest on this one, is it still being worked on?

nottrobin commented 3 years ago

Oh yes. There's been a lot of work on this, back and forth with IS etc. I'll be scheduling a task to update the caching rules on flask-base when I get a chance.

cristinadresch commented 3 years ago

Great, thanks for the update @nottrobin

nottrobin commented 3 years ago

The released solution is not as effective as it should be because the content-cache appears to delete caches are 10 minutes of inactivity. A 10-minute cache (as opposed to the 1 day that we're asking for) might be long enough to help maas.io/docs, but probably isn't enough for child pages.

I've filed https://portal.admin.canonical.com/C130433/ to get this limit increased.

nottrobin commented 3 years ago

RT resolved, and docs pages seem significantly faster from a little browse around. I think the fix is working, so closing this. Please feel free to reopen or file a new issue if you still think this is a problem.

antongisli commented 3 years ago

Hi, the docs section is still performing very poorly. I get up to 20 seconds for some pages to load/change content in my browser. Subsequent visits are much faster. Assuming my experience is modal, this is going to be giving us a very poor first experience for new people exploring MAAS and reading the docs. I think we need to re-open this ticket.

E.g. check this: https://developers.google.com/speed/pagespeed/insights/?url=https%3A%2F%2Fmaas.io%2Fdocs%2Fsnap%2F3.0%2Fui%2Fpower-management (not sure if caching helps or not, or if it is cached).

pmahnke commented 3 years ago

One obvious thing to note is that images on discourse are not cached at all, perhaps IS can fix that fairly easily. Or could we prepend the image with the cloudinary url? Also, could we add the loading="lazy" to images?

There are many other issues that have to deal with the page itself (long, big images, lots of dom elements) that the web team can't easily fix.

On Fri, 1 Oct 2021 at 17:22, Jeff Pihach @.***> wrote:

Reopened #559 https://github.com/canonical-web-and-design/maas.io/issues/559.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/canonical-web-and-design/maas.io/issues/559#event-5395509028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADLXANKCJIFCCJ7IX2A5Y3UEXN3ZANCNFSM4XA7RVAA .

antongisli commented 3 years ago

hi @pmahnke - I'm not certain how or what we should try to do about this, but I think that 20 seconds load times is bad even for a poorly optimised website. Could we try to get the IS changes you mentioned implemented? how do we go about that?

nottrobin commented 3 years ago

I don't think page speed insights is very relevant here. It might complain about things, but most of what it complains about and suggests are at the margins. The core problem we're talking about is the initial page response sometimes taking a long time. This is purely about the cache, and we really should ignore any other concerns about images or whatever - they're nothing compared to the basic page load. Cached pages actually feel very snappy.

Most of the pages in the primary nav load very fast, because they're cached. Some of them don't. The pages that don't are clearly the less popular ones, which is why they get purged from the cache. In theory this should mean they haven't been visited in the last day on that particular cache node, which is some pretty low traffic. It's probably worth looking into whether this is definitely true and things aren't being expired from the cache early for some reason.

This problem is also exacerbated by the fact that docs are split by UI/CLI and version. Basically all the pages for 2.9 aren't cached, because no-one visits them.

I don't know whether to explore extending the cache timeout even longer - e.g. to 1 week or so. That would probably help.

But as regard speeding up the uncached page loads themselves - at one point we had a ticket requesting IS to add a cache in front of our API requests to Discourse. I'll try to find out what happened with that and update here. It's also worth dissecting what API calls we're actually making and why it all takes quite so long. These delays seem a fair bit longer than I would expect based on what these pages are actually doing.

hatched commented 3 years ago

These pages are still taking just under 20s to load. The fact that others are seeing the exact same timing leads me to believe that there is a timeout that's triggered just before the 20s mark which 'releases' the webpage so I don't believe that this is a caching issue.

nottrobin commented 3 years ago

These pages are still taking just under 20s to load. The fact that others are seeing the exact same timing leads me to believe that there is a timeout that's triggered just before the 20s mark which 'releases' the webpage

Yeah, this seems quite plausible, interesting to investigate.

so I don't believe that this is a caching issue.

I don't understand what you mean by this. Pages that are in the cache do not take 20s. The 20s pages are pages that aren't in the cache. We can try to investigate why these pages take 20s, that's definitely valuable. But the cache speeds up the pages it can, and so we can also probably improve things by understanding why these pages aren't in the cache and tweaking the caching settings.

hatched commented 3 years ago

Sorry what I meant was that the original 20s issue doesn't appear to be a cache issue as even if there was a cache miss this should still only take a short amount of time to pull.

spads-spads commented 3 years ago

It's been confirmed that this causes the link checker to DOS the webteam's k8s installation once per day, which is pretty serious.

sparkiegeek commented 3 years ago

https://github.com/canonical-web-and-design/maas.io/pull/637 should address this (albeit temporarily)

canonical / maas.io