Closed suninuni closed 4 months ago
releated https://github.com/apache/apisix/issues/3917#issuecomment-921736339
@tzssangglass FYI
It's keep growing, FYI.
First of all, we need to check if the error log is caused by the memory used by Prometheus is unbound or just because the default size is too small.
The configuration of lua shdict is not available until 2.8, so you may need to modify the ngx_tpl.lua: https://github.com/apache/apisix/pull/4524
If the Prometheus memory usage grows without limit, it will consume all the memory configurated. Otherwise, the memory usage stops growing in a level.
We also need to compare the Prometheus metrics before/after the http request. Prometheus client is a well-known memory consumer. How many metrics/labels are there in the Prometheus metrics?
You can also use X-Ray to diagnose the memory issue: https://openresty.com.cn/cn/xray/. Note that I am not the developer of X-Ray (it is a commercial product developed by others).
The configuration of lua shdict is not available until 2.8, so you may need to modify the ngx_tpl.lua
We are now deploying apisix in the test environment, so I will upgrade apisix to 2.9 and retest in next week.
How many metrics/labels are there in the Prometheus metrics?
I counted the number of metrics rows in two pods with different memory consumption. Is this magnitude acceptable?
memory 763m: 18869
memory 596m: 17924
IMO, this is not normal, but it still depends on the size of the lua share dict assigned to prometheus in your runtime nginx.conf, and the number of concurrent requests. I suggest that you try with X-Ray that @spacewander mentioned.
@suninuni the latest version of APISIX is 2.11 now, can you make a try with APISIX 2.11
? We can not sure if we have fixed this issue.
If you still have this problem, please let us know.
@suninuni the latest version of APISIX is 2.11 now, can you make a try with APISIX
2.11
? We can not sure if we have fixed this issue.If you still have this problem, please let us know.
The same error log as before.
2021/12/30 02:36:43 [error] 50#50: *20414229 [lua] prometheus.lua:860: log_error(): Unexpected error adding a key: no memory while logging request
And now I try to set lua_shared_dict.prometheus-metrics
as 100m, FYI.
Although not mentioned before, I also set up dict before upgrading and also encountered the memory leak problem, FYI.
And now I try to set lua_shared_dict.prometheus-metrics as 100m
It seems that the memory does not increase after it has increased to a certain value, and I have not found no memory while logging request
log any more.
So it seems that this problem no longer exists with APISIX 2.11
, I will continue to observe. Thanks for your support!
Unexpected error adding a key
So the memory increasing is normal and the no memory issue is just because the pre-defined lua shared dict too small?
Unexpected error adding a key
So the memory increasing is normal and the no memory issue is just because the pre-defined lua shared dict too small?
After a period of stability, memory has being continue increasing until oom...
But there is no nginx metrics errors like Unexpected error adding a key
.
@tokers @membphis FYI
So it seems that this problem no longer exists with APISIX
2.11
, I will continue to observe. Thanks for your support!
do you update your apisix to 2.11
?
So it seems that this problem no longer exists with APISIX
2.11
, I will continue to observe. Thanks for your support!do you update your apisix to
2.11
?
yes
Some monitoring datas which may be helpful.
Finally, I solved this problem by removing the node label (balance_ip) for the metrics in exporter.lua.
The root cause is that the key saved in the shared dict contains all labels, such as idx=__ngx_prom__key_35279, key=http_status{code="499",route="xxxx",matched_uri="/*",matched_host="xxxxx ",service="",consumer="",node="10.32.47.129"}
. In a k8s cluster, or in a k8s cluster that the deployments are frequently updated, the information of nodes is always changing, resulting in dict always growing.
Maybe we can change the node label as extra_labels and let the user decide if this label is needed.
Finally, I solved this problem by removing the node label (balance_ip) for the metrics in exporter.lua.
The root cause is that the key saved in the shared dict contains all labels, such as
idx=__ngx_prom__key_35279, key=http_status{code="499",route="xxxx",matched_uri="/*",matched_host="xxxxx ",service="",consumer="",node="10.32.47.129"}
. In a k8s cluster, or in a k8s cluster that the deployments are frequently updated, the information of nodes is always changing, resulting in dict always growing.Maybe we can change the node label as extra_labels and let the user decide if this label is needed.
Indeed, that's a question when users deploy Apache APISIX on Kubernetes.
@suninuni How did you remove the node
label. Are you using a custom build of apisix?
I fixed this by adding the following module_hook
local apisix = require("apisix")
local old_http_init = apisix.http_init
apisix.http_init = function (...)
ngx.log(ngx.EMERG, "Module hooks loaded")
old_http_init(...)
end
local exporter = require("apisix.plugins.prometheus.exporter")
local old_http_log = exporter.http_log
exporter.http_log = function (conf, ctx)
ctx.balancer_ip = "_overwritten_"
old_http_log(conf, ctx)
end
I put that in a config map and loaded it into the apisix config using the following config in my helmchart values.yml file.
luaModuleHook:
enabled: true
luaPath: "/usr/local/apisix/apisix/module_hooks/?.lua"
hookPoint: "module_hook"
configMapRef:
name: "apisix-module-hooks"
mounts:
- key: "module_hook.lua"
path: "/usr/local/apisix/apisix/module_hooks/module_hook.lua"
I observed that it was actually Apisix HTTP Latency There are too many bucket indicators that are constantly increasing, causing memory to continue to rise Apisix HTTP_ Status is not particularly abundant
Apisix Nginx Metric Errors There are many errors in the total indicator, and the log reports errors
content:2023/07/06 15:41:08 [error] 48#48: *972511102 [lua] init.lua:187: http_ssl_phase(): failed to fetch ssl config: failed to find SNI: please check if the client requests via IP or uses an outdated protocol. If you need to report an issue, provide a packet capture file of the TLS handshake., context: ssl_certificate_by_lua*, client: 109.237.98.226, server: 0.0.0.0:9443
How should I optimize and solve this problem
This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
Issue description
When the prometheus global plugin is enabled, the memory of apisix continues to grow abnormally.
Environment
apisix version
): 2.7Steps to reproduce
ab -n 2000000 -c 2000 http://xxxxx
Actual result
memory keep growing
Error log
*45077733 [lua] prometheus.lua:860: log_error(): Unexpected error adding a key: no memory while logging request,
Expected result