blakearchive / archive

GNU General Public License v2.0
5 stars 7 forks source link

flask cache headers #521

Open ba001 opened 7 years ago

ba001 commented 7 years ago

@nathan-rice i asked David Romani to implement caching until modification, but he noticed flask is caching at a max-age of 12 hours, and apache can't override it. i'm new to flask caching, so i don't know how to disable it and allow for apache caching, or change it to cache until modification. would you mind looking into this or letting me know what to do?

ba001 commented 7 years ago

hmm, maybe this isn't the best way to go since most of the stuff coming out of flask is dynamically generated. is there a way to cache all the dynamically generated stuff until modification? it rarely changes. the only big changes we make are new publications.

nathan-rice commented 7 years ago

While the content generated by flask is dynamic, the same content will always be served from the same URL. Thus caching is a perfectly valid strategy.

The caching David is talking about isn't on our side, it's mainly active on the client side. It is basically a variable set in the HTTP response that tells a client when it should re-use previously downloaded data. David can override whatever variables flask sets in apache, or I can change the value set by flask in the first place.

For more information, see cache control.

ba001 commented 7 years ago

@nathan-rice ok, apparently he tried to override it but couldn't. could you change the value to cache until modification?

ba001 commented 7 years ago

@nathan-rice here's an email he wrote to me today:

Mike

Apache will not override the cache control headers when they are coming from a CGI engine.

At the moment I am seeing that 12 hour expiration.

Date Thu, 10 Aug 2017 18:03:00 GMT Expires Fri, 11 Aug 2017 06:03:00 GMT

In places where we set this use mod_expires and I see headers like

If-Modified-SinceThu, 31 Jul 2014 13:56:41 GMT which is the mod date on the file

nathan-rice commented 7 years ago

Yeah, mod_expires does look pretty gimped. In nginx it's easy.

If I set the cache to validate, that isn't going to speed up the site at all really, it'll only save a little bit of bandwidth in the instances when data hasn't changed, because the server has to look at the data to determine if a change has taken place. If I leave it without validation then the site will be much faster when responses are cached. The downside is that the longer the cache time, the more likely that we'll update the server code and a client will erroneously cache an old response.

My advise is to leave off validation, and extended the expiration time to 2-3 days. If you'd like me to do that, it's fairly easy.

ba001 commented 7 years ago

ok, but does that mean it will take 2 or 3 days for a client's browser to pick up an edit we make?

nathan-rice commented 7 years ago

Yes, if apache is caching the responses. If apache isn't caching the responses, then only if the client has cached the resource recently, and clients can manually purge their cache so it isn't a big problem.

Of course, you could just update the deploy scripts to flush the apache cache, side-stepping the problem.

ba001 commented 7 years ago

interesting. let's do that, extend the expiration time to a long duration and use the deploy scripts to flush the apache cache. would it be possible to flush only what we've modified, or would we have to flush the whole cache?

nathan-rice commented 7 years ago

I don't know if conditional flushing is possible, take a look at the docs and see.

ba001 commented 7 years ago

this looks promising:

-i Be intelligent and run only when there was a modification of the disk cache. This option is only possible together with the -d option.

ba001 commented 7 years ago

oh wait, that's not quite what we want--nevermind

nathan-rice commented 7 years ago

I've modified flask to put a year lifetime on everything. You'll need to add the htcacheclean command to the deploy scripts.

ba001 commented 7 years ago

wait, don't do that yet. could you revert that change? let's set that when @queryluke can update the deploy scripts--i'm not sure how to do that. but if you do, go for it.

also, here's what i think we'd need for targeted cleaning:

Deleting a specific URL

If htcacheclean is passed one or more URLs, each URL will be deleted from the cache. If multiple variants of an URL exists, all variants would be deleted.

When a reverse proxied URL is to be deleted, the effective URL is constructed from the Host header, the port, the path and the query. Note the '?' in the URL must always be specified explicitly, whether a query string is present or not. For example, an attempt to delete the path / from the server localhost, the URL to delete would be http://localhost:80/?.