alibaba / tengine

A distribution of Nginx with some advanced features
https://tengine.taobao.org
BSD 2-Clause "Simplified" License
12.83k stars 2.52k forks source link

proxy_cache: could not allocate node in cache keys zone报错 #1035

Open yourchanges opened 6 years ago

yourchanges commented 6 years ago

升级到最新2.2.2 后, 服务器经常出could not allocate node in cache keys zone错误, 用户端直接看到就是500错误

相关配置:

proxy_temp_path   /home/mem_cache/temp; 
proxy_cache_path  /home/mem_cache/path levels=1:2 keys_zone=cache_one:4g inactive=7d max_size=20g; 

相关错误日志:

[root@linux6 ~]# cat /home/logs/error.log | grep "could not allo" | tail
2018/04/09 18:01:51 [alert] 2036#0: could not allocate node in cache keys zone "cache_one"
2018/04/09 18:01:51 [alert] 21518#0: could not allocate node in cache keys zone "cache_one"
2018/04/09 18:01:51 [alert] 2036#0: could not allocate node in cache keys zone "cache_one"
2018/04/09 18:01:51 [alert] 21518#0: could not allocate node in cache keys zone "cache_one"
2018/04/09 18:01:51 [alert] 21518#0: could not allocate node in cache keys zone "cache_one"
2018/04/09 18:01:51 [alert] 2035#0: could not allocate node in cache keys zone "cache_one"
2018/04/09 18:01:51 [alert] 21518#0: could not allocate node in cache keys zone "cache_one"
2018/04/09 18:01:51 [alert] 21518#0: could not allocate node in cache keys zone "cache_one"
2018/04/09 18:01:51 [alert] 2036#0: could not allocate node in cache keys zone "cache_one"
2018/04/09 18:01:51 [alert] 2030#0: could not allocate node in cache keys zone "cache_one"

必须要重启nginx才可以继续工作

升级前后,流量没有大的变化, keys_zone=cache_one:4g 已经是4GB了, 同时,系统是centos 6.9 64位, 内存 64GB, 报错时内存使用情况

[root@linux6 ~]# free -m
             total       used       free     shared    buffers     cached
Mem:         64375      47377      16997      27311        311      42650
-/+ buffers/cache:       4416      59959
Swap:         5999         22       5977
[root@linux6 ~]# free -g
             total       used       free     shared    buffers     cached
Mem:            62         46         16         26          0         41
-/+ buffers/cache:          4         58
Swap:            5          0          5

我们有4台nginx 机器 在升级后,偶发报这个错误

yourchanges commented 6 years ago

重新reload 都不能正常工作,必须重启nginx, tengine 编译细节(没有使用jemalloc)

[root@linux6 ~]# nginx -V
Tengine version: Tengine/2.2.2 (nginx/1.8.1)
built by gcc 4.4.7 20120313 (Red Hat 4.4.7-18) (GCC) 
TLS SNI support enabled
configure arguments: --prefix=/usr/local/nginx --with-http_stub_status_module --with-http_gzip_static_module --with-http_concat_module --with-http_ssl_module --with-http_v2_module --with-openssl=../openssl-1.0.2n --with-http_lua_module --with-luajit-lib=/usr/local/lib/ --with-luajit-inc=/usr/local/include/luajit-2.0/ --with-lua-inc=/usr/local/include/luajit-2.0/ --with-lua-lib=/usr/local/lib/ --with-ld-opt=-Wl,-rpath,
nginx: loaded modules:
nginx:     ngx_core_module (static)
nginx:     ngx_errlog_module (static)
nginx:     ngx_conf_module (static)
nginx:     ngx_dso_module (static)
nginx:     ngx_events_module (static)
nginx:     ngx_event_core_module (static)
nginx:     ngx_epoll_module (static)
nginx:     ngx_procs_module (static)
nginx:     ngx_proc_core_module (static)
nginx:     ngx_openssl_module (static)
nginx:     ngx_regex_module (static)
nginx:     ngx_http_module (static)
nginx:     ngx_http_core_module (static)
nginx:     ngx_http_log_module (static)
nginx:     ngx_http_upstream_module (static)
nginx:     ngx_http_v2_module (static)
nginx:     ngx_http_static_module (static)
nginx:     ngx_http_gzip_static_module (static)
nginx:     ngx_http_autoindex_module (static)
nginx:     ngx_http_index_module (static)
nginx:     ngx_http_concat_module (static)
nginx:     ngx_http_auth_request_module (static)
nginx:     ngx_http_auth_basic_module (static)
nginx:     ngx_http_access_module (static)
nginx:     ngx_http_limit_conn_module (static)
nginx:     ngx_http_limit_req_module (static)
nginx:     ngx_http_geo_module (static)
nginx:     ngx_http_map_module (static)
nginx:     ngx_http_split_clients_module (static)
nginx:     ngx_http_referer_module (static)
nginx:     ngx_http_rewrite_module (static)
nginx:     ngx_http_ssl_module (static)
nginx:     ngx_http_proxy_module (static)
nginx:     ngx_http_fastcgi_module (static)
nginx:     ngx_http_uwsgi_module (static)
nginx:     ngx_http_scgi_module (static)
nginx:     ngx_http_memcached_module (static)
nginx:     ngx_http_empty_gif_module (static)
nginx:     ngx_http_browser_module (static)
nginx:     ngx_http_user_agent_module (static)
nginx:     ngx_http_upstream_hash_module (static)
nginx:     ngx_http_upstream_ip_hash_module (static)
nginx:     ngx_http_upstream_consistent_hash_module (static)
nginx:     ngx_http_upstream_check_module (static)
nginx:     ngx_http_upstream_least_conn_module (static)
nginx:     ngx_http_upstream_keepalive_module (static)
nginx:     ngx_http_upstream_dynamic_module (static)
nginx:     ngx_http_stub_status_module (static)
nginx:     ngx_http_write_filter_module (static)
nginx:     ngx_http_header_filter_module (static)
nginx:     ngx_http_chunked_filter_module (static)
nginx:     ngx_http_v2_filter_module (static)
nginx:     ngx_http_range_header_filter_module (static)
nginx:     ngx_http_gzip_filter_module (static)
nginx:     ngx_http_postpone_filter_module (static)
nginx:     ngx_http_ssi_filter_module (static)
nginx:     ngx_http_charset_filter_module (static)
nginx:     ngx_http_userid_filter_module (static)
nginx:     ngx_http_footer_filter_module (static)
nginx:     ngx_http_trim_filter_module (static)
nginx:     ngx_http_headers_filter_module (static)
nginx:     ngx_http_upstream_session_sticky_module (static)
nginx:     ngx_http_reqstat_module (static)
nginx:     ngx_http_lua_module (static)
nginx:     ngx_http_copy_filter_module (static)
nginx:     ngx_http_range_body_filter_module (static)
nginx:     ngx_http_not_modified_filter_module (static)
[root@linux6 ~]# 
yourchanges commented 6 years ago

另外还有一个问题: 同一个cache key:

www.9ji.com/static/style/vipcss.css?v=9

有4个文件:

[root@linux1 cachekey]# cat /home/mem_cache/path/1/29/ac8053ca8e6a99ff96543069a12b9291 | head -16
\Z��ZI�h��0"029d32ca68dd31:0"Accept-Encoding��Sʎj���T0i�+��
KEY: www.9ji.com/static/style/vipcss.css?v=9
HTTP/1.1 200 OK
Cache-Control: max-age=86400
Content-Type: text/css
Content-Encoding: gzip
Last-Modified: Mon, 15 Jan 2018 02:11:38 GMT
Accept-Ranges: bytes
ETag: "029d32ca68dd31:0"
Vary: Accept-Encoding
Server: Microsoft-IIS/8.0
VaryTypeServer: web2
VaryType: Main
Access-Control-Allow-Origin: *
Date: Fri, 30 Mar 2018 01:33:58 GMT
Content-Length: 8648
[root@linux1 cachekey]# cat /home/mem_cache/path/6/6a/a0fc297b86c9d2d60f3db94907e216a6 | head -16
\Z�d�ZI�h��0"029d32ca68dd31:0"Accept-Encoding�1w��R�'zyʁ@
KEY: www.9ji.com/static/style/vipcss.css?v=9
HTTP/1.1 200 OK
Cache-Control: max-age=86400
Content-Type: text/css
Content-Encoding: gzip
Last-Modified: Mon, 15 Jan 2018 02:11:38 GMT
Accept-Ranges: bytes
ETag: "029d32ca68dd31:0"
Vary: Accept-Encoding
Server: Microsoft-IIS/8.0
VaryTypeServer: web2
VaryType: Main
Access-Control-Allow-Origin: *
Date: Mon, 26 Mar 2018 03:10:46 GMT
Content-Length: 8648
[root@linux1 cachekey]# cat /home/mem_cache/path/b/3b/fb2acaee2e841e4daa669d74e05d73bb | head -16
\Z0��ZI�h��0"029d32ca68dd31:0"Accept-Encoding�*��.�M�f�t�]s�
KEY: www.9ji.com/static/style/vipcss.css?v=9
HTTP/1.1 200 OK
Cache-Control: max-age=86400
Content-Type: text/css
Content-Encoding: gzip
Last-Modified: Mon, 15 Jan 2018 02:11:38 GMT
Accept-Ranges: bytes
ETag: "029d32ca68dd31:0"
Vary: Accept-Encoding
Server: Microsoft-IIS/8.0
VaryTypeServer: web2
VaryType: Main
Access-Control-Allow-Origin: *
Date: Fri, 30 Mar 2018 00:56:15 GMT
Content-Length: 8648
[root@linux1 cachekey]# cat /home/mem_cache/path/b/36/63bf5eb713ef8013f42b371546ed336b | head -16
\Zr��ZI�h��0"029d32ca68dd31:0"Accept-Encodingc�^����+7F�3k
KEY: www.9ji.com/static/style/vipcss.css?v=9
HTTP/1.1 200 OK
Cache-Control: max-age=86400
Content-Type: text/css
Content-Encoding: gzip
Last-Modified: Mon, 15 Jan 2018 02:11:38 GMT
Accept-Ranges: bytes
ETag: "029d32ca68dd31:0"
Vary: Accept-Encoding
Server: Microsoft-IIS/8.0
VaryTypeServer: web2
VaryType: Main
Access-Control-Allow-Origin: *
Date: Sat, 24 Mar 2018 02:09:53 GMT
Content-Length: 8648
[root@linux1 cachekey]# 

[root@linux1 ~]# 

不知道是什么原因, size etag 都一样.

yourchanges commented 6 years ago

+1

yourchanges commented 6 years ago

+1

yourchanges commented 6 years ago

?

u-kyou commented 6 years ago

请问现在这个问题解决了吗,我也遇到同样的问题

chobits commented 6 years ago
  1. 这个报错是因为新版引专门加了一行errorlog。之前版本是没加这行报错的,但是逻辑没变。
  2. 这个报错原因是: tengine/nginx的从共享内存分配key的时候失败了。注意共享内存实际在使用过程中会出现碎片。没有特别好的方法,之前社区已经优化但是没有根本解决(优化是用来较少碎片,主要是free内存的时候尝试将随便关联释放但是也不能100%解决内存碎片)。
  3. 现在最好的方法是先扩大内存,但是如果key过于频繁更新(或者服务时间长导致更新累计过多),也会导致碎片,实际也不能100%解决。
  4. diff as following between 2.1.2 and 2.2.2:
@@ -687,9 +849,11 @@ ngx_http_file_cache_exists(ngx_http_file

         ngx_shmtx_lock(&cache->shpool->mutex);

-        fcn = ngx_slab_alloc_locked(cache->shpool,
-                                    sizeof(ngx_http_file_cache_node_t));
+        fcn = ngx_slab_calloc_locked(cache->shpool,
+                                     sizeof(ngx_http_file_cache_node_t));
         if (fcn == NULL) {
+            ngx_log_error(NGX_LOG_ALERT, ngx_cycle->log, 0,
+                          "could not allocate node%s", cache->shpool->log_ctx);
             rc = NGX_ERROR;
             goto failed;
         }
  1. nginx shared memory碎片问题解释和优化patch:http://mailman.nginx.org/pipermail/nginx-devel/2014-May/005406.html
chobits commented 6 years ago

for https://github.com/alibaba/tengine/issues/1035#issuecomment-380045091

Answer: 多个过期文件,但是实际未从磁盘删除

cc @dengqian 再帮忙确认下这个细节

image

yourchanges commented 6 years ago

我们目前通过两个方式绕过:

  1. 加大 keys_zone=cache_one:4g to keys_zone=cache_one:8g
  2. 我们单独做了监控缓存文件磁盘占用情况, 有异常就重启, 缓解keys 内存碎片, 和cache 文件占用变大问题(配置的21G, 实际可能会到 40多G)

只能说,可以工作.

另外我们的cache 文件 /home/mem_path 是从 系统shm里挂载出来的. 排查确认的同学 辛苦了.

yourchanges commented 6 years ago

双11 又抗不住了, 一会就爆了

dengqian commented 6 years ago

1, 关于同一个key多个缓存文件的原因,是因为用这个key计算出来的md5值不一样,我们可以看到这几个文件的文件名是不一样的,应该是请求来自不同的客户端,编码方式不同,因为响应头里面有Vary: Accept-Encoding,如果源站没有多副本的话可以考虑加上proxy_ignore_headers Vary;这个配置。否则相同的文件需要占用不同的key image 2, 缓存总量大于配置的问题,proxy_cache有一个cache manager 进程定期清理过期文件,相关的配置 proxy_cache_path path [levels=levels] [use_temp_path=on|off] keys_zone=name:size [inactive=time] [max_size=size] [manager_files=number] [manager_sleep=time] [manager_threshold=time] [loader_files=number] [loader_sleep=time] [loader_threshold=time] [purger=on|off] [purger_files=number] [purger_sleep=time] [purger_threshold=time]; 这里manager_threshold(默认200ms)是每一个清理过程持续的时间, 两个iteration间隔的时间为 manager_sleep(默认50ms), 每个清理过程最多清理的文件数目为manager_files(默认100), 在sleep期间不会清理文件,新生成的文件可能超过缓存的配置大小,如果文件更新频繁的话可以考虑把manager_files设置大一点,或者根据需求调整下配置。

yourchanges commented 6 years ago

我们确认老服务器有,且只有 Vary: Accept-Encoding

按照缓存服务器通用原理, 不是应该只区分压缩不压缩, 应该最多只有两份, 但我们这个同一key有4份? Accept-Encoding中不同算法也参与 计算key md5 哈希?

yourchanges commented 6 years ago

另外,如果我加大manager_files 比如到10000 , 但是没法在一个处理周期manager_threshold 处理完怎么办?

yourchanges commented 6 years ago

有没有什么最佳配置推荐

dengqian commented 6 years ago

我们确认老服务器有,且只有 Vary: Accept-Encoding

按照缓存服务器通用原理, 不是应该只区分压缩不压缩, 应该最多只有两份, 但我们这个同一key有4份? Accept-Encoding中不同算法也参与 计算key md5 哈希?

计算文件名是利用请求头和uri, 不同浏览器的编码方式可能不同,proxy_cache利用收到的请求计算的md5值会不一样,这个是多个key的原因,可能请求的encoding方式多于你提供的,但是你返回的只有两种形式的内容。如果都是gzip的话这个头忽略掉应该风险不大。

dengqian commented 6 years ago

另外,如果我加大manager_files 比如到10000 , 但是没法在一个处理周期manager_threshold 处理完怎么办?

这个只是设定一个上限。处理周期的时间到了也不会再继续处理文件。proxy_cache的配置我们也缺少实践,可以调整来看下效果。

yourchanges commented 6 years ago

另外, 我加了配置, 根本无法通过测试, 报

nginx: [emerg] invalid parameter "manager_files=500" in /usr/local/nginx/conf/proxy.conf:150

版本是最新的2.2.3

我看官方文档写的

The data is removed in iterations configured by manager_files, manager_threshold, and manager_sleep parameters (1.11.5)

应该是 tengine 还没合并到这部分功能, 现在是nginx 1.8 嘛.

小板凳坐等你们升级

yourchanges commented 6 years ago

翻了下代码,也没看到这种控制逻辑

https://github.com/alibaba/tengine/blob/7c042373038ccada2c786c42bf4562f25366d7f2/src/http/ngx_http_file_cache.c#L1873

dengqian commented 6 years ago

翻了下代码,也没看到这种控制逻辑

tengine/src/http/ngx_http_file_cache.c

Line 1873 in 7c04237

ngx_http_file_cache_manager(void *data)

当前版本的确还不支持。。

yourchanges commented 6 years ago

那请问计划什么时候升级呢?

dengqian commented 6 years ago

那请问计划什么时候升级呢?

这个要问 @chobits 了

bukebuhao commented 5 years ago

一样的问题,只能加大内存了