memory leak when enable prometheus as global

suninuni commented 3 years ago

Issue description

When the prometheus global plugin is enabled, the memory of apisix continues to grow abnormally.

Environment

apisix version (cmd: apisix version): 2.7

Steps to reproduce

deploy httpbin service in k8s (step in apisix-inrgess-controller)
enable prometheus as global rule and set perfer_name as true (lua_shared_dict.prometheus-metrics is 10m)
mock http request throw ab with high concurrency ab -n 2000000 -c 2000 http://xxxxx

Actual result

memory keep growing

Error log

*45077733 [lua] prometheus.lua:860: log_error(): Unexpected error adding a key: no memory while logging request,

Expected result

The memory should not continue to grow.
Prometheus should release a part of the memory allocated to it after it is used up to avoid continuous errors.

suninuni commented 3 years ago

releated https://github.com/apache/apisix/issues/3917#issuecomment-921736339

@tzssangglass FYI

suninuni commented 3 years ago

It's keep growing, FYI.

spacewander commented 3 years ago

First of all, we need to check if the error log is caused by the memory used by Prometheus is unbound or just because the default size is too small.

The configuration of lua shdict is not available until 2.8, so you may need to modify the ngx_tpl.lua: https://github.com/apache/apisix/pull/4524

If the Prometheus memory usage grows without limit, it will consume all the memory configurated. Otherwise, the memory usage stops growing in a level.

We also need to compare the Prometheus metrics before/after the http request. Prometheus client is a well-known memory consumer. How many metrics/labels are there in the Prometheus metrics?

You can also use X-Ray to diagnose the memory issue: https://openresty.com.cn/cn/xray/. Note that I am not the developer of X-Ray (it is a commercial product developed by others).

suninuni commented 3 years ago

The configuration of lua shdict is not available until 2.8, so you may need to modify the ngx_tpl.lua

We are now deploying apisix in the test environment, so I will upgrade apisix to 2.9 and retest in next week.

How many metrics/labels are there in the Prometheus metrics?

I counted the number of metrics rows in two pods with different memory consumption. Is this magnitude acceptable?

memory 763m: 18869

memory 596m: 17924

tzssangglass commented 3 years ago

IMO, this is not normal, but it still depends on the size of the lua share dict assigned to prometheus in your runtime nginx.conf, and the number of concurrent requests. I suggest that you try with X-Ray that @spacewander mentioned.

membphis commented 2 years ago

@suninuni the latest version of APISIX is 2.11 now, can you make a try with APISIX 2.11? We can not sure if we have fixed this issue.

If you still have this problem, please let us know.

suninuni commented 2 years ago

@suninuni the latest version of APISIX is 2.11 now, can you make a try with APISIX 2.11? We can not sure if we have fixed this issue.

If you still have this problem, please let us know.

The same error log as before.

2021/12/30 02:36:43 [error] 50#50: *20414229 [lua] prometheus.lua:860: log_error(): Unexpected error adding a key: no memory while logging request

And now I try to set lua_shared_dict.prometheus-metrics as 100m, FYI.

Although not mentioned before, I also set up dict before upgrading and also encountered the memory leak problem, FYI.

suninuni commented 2 years ago

And now I try to set lua_shared_dict.prometheus-metrics as 100m

It seems that the memory does not increase after it has increased to a certain value, and I have not found no memory while logging request log any more.

So it seems that this problem no longer exists with APISIX 2.11, I will continue to observe. Thanks for your support!

tokers commented 2 years ago

Unexpected error adding a key

So the memory increasing is normal and the no memory issue is just because the pre-defined lua shared dict too small?

suninuni commented 2 years ago

Unexpected error adding a key

So the memory increasing is normal and the no memory issue is just because the pre-defined lua shared dict too small?

After a period of stability, memory has being continue increasing until oom...

But there is no nginx metrics errors like Unexpected error adding a key.

suninuni commented 2 years ago

@tokers @membphis FYI

membphis commented 2 years ago

So it seems that this problem no longer exists with APISIX 2.11, I will continue to observe. Thanks for your support!

do you update your apisix to 2.11?

suninuni commented 2 years ago

So it seems that this problem no longer exists with APISIX 2.11, I will continue to observe. Thanks for your support!

do you update your apisix to 2.11?

yes

suninuni commented 2 years ago

Some monitoring datas which may be helpful.

suninuni commented 2 years ago

Finally, I solved this problem by removing the node label (balance_ip) for the metrics in exporter.lua.

The root cause is that the key saved in the shared dict contains all labels, such as idx=__ngx_prom__key_35279, key=http_status{code="499",route="xxxx",matched_uri="/*",matched_host="xxxxx ",service="",consumer="",node="10.32.47.129"}. In a k8s cluster, or in a k8s cluster that the deployments are frequently updated, the information of nodes is always changing, resulting in dict always growing.

Maybe we can change the node label as extra_labels and let the user decide if this label is needed.

tokers commented 2 years ago

Finally, I solved this problem by removing the node label (balance_ip) for the metrics in exporter.lua.

The root cause is that the key saved in the shared dict contains all labels, such as idx=__ngx_prom__key_35279, key=http_status{code="499",route="xxxx",matched_uri="/*",matched_host="xxxxx ",service="",consumer="",node="10.32.47.129"}. In a k8s cluster, or in a k8s cluster that the deployments are frequently updated, the information of nodes is always changing, resulting in dict always growing.

Maybe we can change the node label as extra_labels and let the user decide if this label is needed.

Indeed, that's a question when users deploy Apache APISIX on Kubernetes.

va3093 commented 1 year ago

@suninuni How did you remove the node label. Are you using a custom build of apisix?

va3093 commented 1 year ago

I fixed this by adding the following module_hook

local apisix = require("apisix")

local old_http_init = apisix.http_init
apisix.http_init = function (...)
    ngx.log(ngx.EMERG, "Module hooks loaded")
    old_http_init(...)
end

local exporter = require("apisix.plugins.prometheus.exporter")
local old_http_log = exporter.http_log
exporter.http_log = function (conf, ctx)
    ctx.balancer_ip = "_overwritten_"
    old_http_log(conf, ctx)
end

I put that in a config map and loaded it into the apisix config using the following config in my helmchart values.yml file.

  luaModuleHook:
    enabled: true
    luaPath: "/usr/local/apisix/apisix/module_hooks/?.lua"
    hookPoint: "module_hook"
    configMapRef:
      name: "apisix-module-hooks"
      mounts:
        - key: "module_hook.lua"
          path: "/usr/local/apisix/apisix/module_hooks/module_hook.lua"

jinjianming commented 1 year ago

I observed that it was actually Apisix HTTP Latency There are too many bucket indicators that are constantly increasing, causing memory to continue to rise Apisix HTTP_ Status is not particularly abundant

Apisix Nginx Metric Errors There are many errors in the total indicator, and the log reports errors

content:2023/07/06 15:41:08 [error] 48#48: *972511102 [lua] init.lua:187: http_ssl_phase(): failed to fetch ssl config: failed to find SNI: please check if the client requests via IP or uses an outdated protocol. If you need to report an issue, provide a packet capture file of the TLS handshake., context: ssl_certificate_by_lua*, client: 109.237.98.226, server: 0.0.0.0:9443

How should I optimize and solve this problem

github-actions[bot] commented 4 months ago

This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

apache / apisix