浅谈 HTTP 缓存实践

fi3ework commented 6 years ago

一些疑点

request 中的 cache-control vs. response 中的 cache-control

cache-control 作为一个 general header，在 request 和 response 中都可以存在？那么如果假设我们在某 HTML 的 meta 中写了

<meta name="Cache-Control" content="no-cache">

但是和这个 HTML 的 response 中由服务器设定的 cache-control 不一致时，如下图，demo

下一次请求的 cache-control 该听谁的呢？在这个例子中，重复访问页面，第二次访问的 HTML 是 from disk，也就是说缓存生效了。

MDN 中语焉不详，只是说 a given directive in a request is not implying that the same directive is to be given in the response。我也没在 HTTP 规范中找到对应的规则。

查了一下资料，发现这应该是一个由服务器来控制的规则：服务器可以选择是否忽略 request 中的 cache-control，如果不忽略就按 request 中的 cache-control 规则来，忽略了就按 response 中的 cache-control 规则来。

比如Apache

比如 Nginx

策略

不缓存

expires: 0
pragma: no-cache
cache-control: no-store, no-cache, must-revalidate, proxy-revalidate

no-store: 告诉浏览器、缓存服务器不要保存副本，每次都要去向源服务器请求新的资源。
no-cache：告诉浏览器、缓存服务器，不管本地副本是否过期，使用资源副本前，一定要到源服务器进行副本有效性校验。
must-revalidate：告诉浏览器、缓存服务器，本地副本过期前，可以使用本地副本；本地副本一旦过期，必须去源服务器进行有效性校验。
proxy-revalidate：告诉缓存服务器不要保存副本。

再加上兼容 HTTP/1.0 的 expires 和 pragma，即禁止一切缓存，请求每次都要发往源服务器。

长 max-age + 指纹

Cache-Control: max-age=31536000

将 Cache-Control 设定的很长，即在 Cache-Control 没有过期的情况下将直接从浏览器中取出缓存（from memory cache 或 from disk cache），但是这样也彻底限制了资源更新的可能。

通过给资源的 URL 添加一个“指纹”，可以是版本号，hash，MD5 或日期等。

<script src="/script-f93bca2c.js"></script>
<link rel="stylesheet" href="/styles-a837cb1e.css">
<img src="/cats-0e9a2ef4.jpg" alt="…">

通过 HTML 的更新来控制对应资源是否更新，这样做的好处是在 HTML 没更新的时候直接从浏览器中取缓存，有效避免 304，进一步减小服务器的压力；HTML 更新后也会更新资源文件的文件名，URI 变了浏览器自然会去向源服务器请求新的资源。

知乎 & 掘金 & GitHub

我们通过分析知乎和掘金的的 HTTP 缓存实践来看下这两个网站是如何进行缓存的：

知乎

*.html

知乎的主页是由服务端动态生成的，所以采用完全不缓存的策略。
.js, .css

.js 和 .css 采用的都是长 max-age + 指纹的策略，由上面的 HTML 来控制是否更新。
静态资源

图片等同样适用 max-age + 指纹的策略。
Ajax

对于涉及到用户个人的信息，要特别在 cache-control 中指出 private 来防止缓存服务器缓存，然后再禁止掉所有本地缓存。

掘金

*.html

掘金的 HTML 只在 Cache-Control 中写了 private，禁止缓存服务器缓存，但是也没有指定 max-age，所以每次还是会去请求源服务器。

不过根据 MSDN 中描述的，Cache-Control 默认值就是 private，所以不写应该也没问题。
.js, .css

对于加指纹的文件，与知乎的策略近似，这里多了一个 public，意欲何为？
静态资源

资源文件有指纹，所以采用长 max-age + 指纹的策略，又多了个 public？
Ajax

采用长 max-age + 指纹的策略并且不允许缓存服务器缓存。

GitHub

*.html

采用 no-cache，这样可以利用缓存服务器，缓存服务器在发回备份前会先向源服务器确认缓存是否可用，如果可用则返回给浏览器备份，否则要再向源服务器发起请求。
.js, .css

同掘金
静态资源

不同于前两者的资源 URI，github 的资源 URI 采用的是 {id}?s={size}&v={version} 的格式，没有指纹的加持，就要保证资源在改变时及时更新，

github 是默认给缓存五分钟，五分钟之内直接从本地浏览器缓存中拿，如果超过了五分钟则去比较 Etag，Last-Modified 和 Expires，如果改变了就向源服务器 200 一个新的，如果没改变就会返回一个 304。
Ajax

同 html。

总结

HTTP 缓存不存在银弹，只有根据当前业务特点还有后端资源的配置寻求最适合的配置。

参考

yifei-zhan commented 6 years ago

针对那个疑点，我的理解是：其实不管是 request 的 cache-control 还是 response 的 cache-control，都是一个 server <-> browser/client 之间的传达消息的方式。

request cache-control 有点像，想告诉服务器，请你设置成我的 cache-control 服务器拿到后，可以听 request的，也可以不听（自己设置或者默认(express 下是 public, max-age=0)）

最后浏览器拿到 response, 浏览器的缓存策略根据这个 response的 cache-control 而定，所以还是 response cache-control 有生杀大权

fi3ework commented 6 years ago

@Kiiiwiii 同意你的观点，刚刚也补上了

在这个例子中，重复访问页面，第二次访问的 HTML 是 from disk，也就是说缓存生效了。

来印证了你的看法😄。

LiuL0703 commented 5 years ago

那个疑点RFC-2616里是这么说的，大概意思和你们讨论的差不多

Cache directives are unidirectional in that the presence of a directive in a request does not imply that the same directive is to be given in the response.

对于cache-request-directive 以max-age为例：

max-age Indicates that the client is willing to accept a response whose age is no greater than the specified time in seconds. Unless max-stale directive is also included, the client is not willing to accept a stale response

fi3ework / blog