canonical / ubuntu.com

The official website for the Ubuntu operating system
https://ubuntu.com
Other
191 stars 189 forks source link

server side caching on ubuntu.com #8687

Open tbille opened 3 years ago

tbille commented 3 years ago

original comment https://github.com/canonical-web-and-design/ubuntu.com/issues/8686#issuecomment-722954230

Hello,

This is a consequence of another issue, which is more of a "design" one. Some pages visited upon hitting www.ubuntu.com try to hit discourse.ubuntu.com to download some json file.

A fix has been committed yesterday (#8675) to avoid hitting the rate limit, but the issue is that if discourse.ubuntu.com goes down, several pages are going to reply a 500 to the end user. Content-cache won't help here, as it will try to validate the cache, sees a 500, because flask will have gotten a 500 from discourse and the end user will end up with an error.

I think we need a sort of cache for every request made to discourse.ubuntu.com and fall back on the last know working copy on the flask side.

What do you think ?

nottrobin commented 3 years ago

(copied over from https://github.com/canonical-web-and-design/ubuntu.com/issues/8686#issuecomment-722966598)

Content-cache won't help here, as it will try to validate the cache, sees a 500

I believe the content cache can help through careful use of the stale-if-error cache control instruction (I'm 99% sure content cache supports it), which we can look into.

I also believe this is the right way to solve it. We've tried caching at the flask layer before and it was a mess. Aside from the general problem of multiple caches compounding invalidation issues which always leads to a bunch of problems, there's the issue of there being many replicas of the Flask app, which also compounds things, memory issues etc etc.

Basically, since we have a cache layer already, we should leverage it as much as we possibly can rather than adding more, and I believe in this case we absolutely can.

SirSamTumless commented 3 years ago

TBD in IS catch-up

4a6f656c commented 3 years ago

(copied over from #8686 (comment))

Content-cache won't help here, as it will try to validate the cache, sees a 500

I believe the content cache can help through careful use of the stale-if-error cache control instruction (I'm 99% sure content cache supports it), which we can look into.

I also believe this is the right way to solve it. We've tried caching at the flask layer before and it was a mess. Aside from the general problem of multiple caches compounding invalidation issues which always leads to a bunch of problems, there's the issue of there being many replicas of the Flask app, which also compounds things, memory issues etc etc.

Basically, since we have a cache layer already, we should leverage it as much as we possibly can rather than adding more, and I believe in this case we absolutely can.

content-cache is designed as a frontend caching layer - its purpose is to reduce the number of requests to the web application and serve requests from a more performant location. It can additionally assist with failures at the web application by serving last known good content.

However, this is not intended as a mechanism to allow the web application to perform excessive requests against another service, effectively DoSing it. The web application needs to either cache this kind of data itself, or use a separate service to provide caching of internal/backend API requests on its behalf - the frontend content-cache is not the solution here.

hloeung commented 3 years ago

Just found this issue, wish GitHub had some way of subscribing the IS team on issues of interest to us (much like with bugs filed on Launchpad). But that's another story.

The fix committed the other day, (#8675), doesn't really avoid hitting the rate limit but rather changes which rate limits applies. Specifying API user and key gets us:

DISCOURSE_MAX_USER_API_REQS_PER_MINUTE: default 20 DISCOURSE_MAX_USER_API_REQS_PER_DAY: default 2880

Without it, per #8675, changes to:

DISCOURSE_MAX_REQS_PER_IP_PER_MINUTE: number of requests per IP per minute (default is 200) DISCOURSE_MAX_REQS_PER_IP_PER_10_SECONDS: number of requests per IP per 10 seconds (default is 50)

That's quite a jump and we're probably safe and/or good for now but just to clear, rate limiting still applies. We may hit limits again when more posts/content from Discourse are used, or other sites changed to rely on Discourse content, or other environments deployed sharing the same OpenStack neutron router hitting this same Discourse instance.

The current outstanding issues in Sentry shows the following:

Sentry ubuntu-com/issues/12037/ 42k events, Discourse 17229.json Sentry ubuntu-com/issues/11923/ 41k events, Discourse 11322.json Sentry ubuntu-com/issues/12913/ 2.3k events, Discourse 17229.json Sentry ubuntu-com/issues/12136/ 97 events, Discourse 17229.json Sentry ubuntu-com/issues/12258/ 84 events, Discourse 17229.json Sentry ubuntu-com/issues/12255/ 34 events, Discourse 17384.json

Having a look at a couple of them, they look to be mapping URLs to Discourse posts? Also looks to be serving content from Discourse? Is this set to replace the insights blog? How often are the content for these posts updated/changed? Perhaps we could consider baking them, or at least the mapping posts, into the webapp itself? Fetch for updated mapping post if URL not found in the local/baked copy? We would have to deal with modifications to existing mappings though.

Also, can we make it so fetch failures would show something else for that section on the page rather than render a full 500 Internal Server Error to the client?

If the plans are indeed to host content on Discourse and serve those, we'll likely want to look into scaling either Discourse or introduce some kind of proxy/caching between the webapp in K8s and Discourse (shameless plug, perhaps content-cache-k8s?). Or a webhook on Discourse post edits/changes that pulls content in and populates a DB.

nottrobin commented 3 years ago

As I mentioned in the IS catch-up, I am starting a document to explore this topic and highlight improvements we could make. Please feel free to contribute to it. We will discuss it again in our next catch-up this Friday.

I'll respond inline here, but I imagine that most of this discussion should move to the document from here on.

@4a6f656c: However, this is not intended as a mechanism to allow the web application to perform excessive requests against another service, effectively DoSing it.

I think you mean that we should not be relying on the content cache to routinely mask errors in our application - we should instead ensure our application doesn't have errors. Yes, that's a good point, while we might want to use stale-if-error to avoid showing users errors when things break, we should nonetheless do our best to ensure that nothing breaks ever.

With the content-cache in front of our websites, however, the amount of requests made to Discourse should be relatively predictable (a factor of the number of pages we have that make calls and the cache length on those pages). If we look into this, we should then be able to determine whether this load is manageable or not.

As things stand, we believe the current default public limit of 200 requests per minute to the API is sufficient (the problems we had were because we were using the admin account limit of 60). And so I'm only suggesting we consider using stale-if-error to avoid users seeing it if we inadvertently break that limit. If we do, we should still ensure that we put in necessary fixes.

@4a6f656c: The web application needs to either cache this kind of data itself, or use a separate service to provide caching of internal/backend API requests on its behalf

As I said, caching within the application proves problematic because, fundamentally, a Flask application unit is not built for caching at scale:

Putting a layer between our application and the API it's consuming didn't work when we tried it before because none of the available caching layer technologies at the time (Squid, Varnish) could proxy-cache from HTTPS backends. I believe that maybe Nginx can do it, which is what you're using in content-cache isn't it?

Because of this, I think the best solution would be to make use of content-cache in front of the Discourse API so that it could handle more calls. This is effectively how it was solved in snapcraft.io, with the Snap Store API adding a Squid cache in front of the API to remain responsive. So if we genuinely need more calls than Discourse can naturally handle, I would suggest that we first ensure Discourse is behind the content cache (I don't know if it is right now), and then either:

Also, can we make it so fetch failures would show something else for that section on the page rather than render a full 500 Internal Server Error to the client?

Yes, this we absolutely should do. I've filed https://github.com/canonical-web-and-design/ubuntu.com/issues/8700 to track it.

stale[bot] commented 3 years ago

🧹 This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

hloeung commented 3 years ago

Bump, it's still an issue and shouldn't be closed due to inactivity.

pmahnke commented 3 years ago

@nottrobin please let's decide the best thing to do on this.

petesfrench commented 3 weeks ago

I have reached out the the Infrastructure guild to see what we should do about this. https://chat.canonical.com/canonical/channels/web--design-infrastructure-guild

samhotep commented 3 weeks ago

Hello all,

I imagine that other API we access (wordpress, github, etc) would benefit from having a cache in front of them, and we could also benefit from having a dedicated service where all these API are defined and managed separately from the cache that serves the websites, considering that the content-cache is deployed as a CDN.

For example, using the content-cache could be problematic if we need to do actions like bursting the cache frequently, and having a dedicated service means we can quickly experiment with different cache values and algorithms, setup refresh scripts and alerts, and collect metrics specifically about our api usage, possibly eventually forwarding them to prometheus.

I think it's at least worth looking into, and I've created a spec here