apache / apisix

The Cloud-Native API Gateway
https://apisix.apache.org/blog/
Apache License 2.0
14.3k stars 2.49k forks source link

help request: 3.2.1 memory leak #10618

Open wklken opened 9 months ago

wklken commented 9 months ago

Description

after deployed online for 2 weeks, we reschedule the pods, then got the chart below. from 3.7G to 6G.

We have no ext-plugins.

about 45000 routes

image


I have some suspicion that it is caused by the prometheus plugin, when all routes are all presented, the keys in prometheus is stable?

is there any tool to analysis this, while we don't have xray.

Environment

wklken commented 9 months ago

the /metrics response data after online for about two week

16098 api_request_duration_milliseconds_bucket
3674 api_request_duration_milliseconds_count
3674 api_request_duration_milliseconds_sum
4990 api_requests_total
7948 bapp_requests_total
12846 bandwidth
11 etcd_modify_indexes
1 etcd_reachable 1
1 http_requests_total 460517663
6 nginx_http_current_connections
1 nginx_metric_errors_total 0
1 node_info
24 shared_dict_capacity_bytes
24 shared_dict_free_space_bytes
boekkooi-lengoo commented 9 months ago

Hey @wklken

I have noticed a similar issue and was able to remove it by forcing the consumer to always be empty.

I use the following in my dockerfile to patch the issue.

# Patch https://github.com/apache/apisix/blob/3.7.0/apisix/plugins/prometheus/exporter.lua#L228 to avoid metrics per consumer.
RUN sed -i \
    -e 's/ctx.consumer_name or ""/""/g' \
    /usr/local/apisix/apisix/plugins/prometheus/exporter.lua

Hope this helps.

wklken commented 9 months ago

Thanks @boekkooi-lengoo

we have patched some settings to disable official metrics, which will cause the cpu 100% if too much records present.

https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/001_change_prometheus_default_buckets.patch

currently only the bandwidth left.


I’m not certain whether the increasing memory usage is caused by the Prometheus plugin or not, nor do I understand why it is consuming so much memory.

monkeyDluffy6017 commented 8 months ago

@wklken have you solved your problem?

wklken commented 8 months ago
image

Not yet; we are waiting for the line (memory usage) to stabilize (about 4 weeks). If it does not show an increase, then perhaps the Prometheus plugin is the cause. Otherwise, we will need to investigate other plugins.

Any advices or tools for detecting the memory usage of each part of apisix?

@monkeyDluffy6017

monkeyDluffy6017 commented 8 months ago

please check if the memory leak happens in lua or in c

curl http://127.0.0.1:9180/apisix/admin/routes/test  \\n-H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d '\n{\n    "uri": "/lua_memory_stat",\n    "plugins": {\n        "serverless-pre-function": {\n            "phase": "rewrite",\n            "functions" : ["return function() local mem = collectgarbage(\"count\") ngx.say(\"the memory allocated by lua is \", mem, \" kb\"); end"]\n        }\n    }\n}'
wklken commented 8 months ago

@monkeyDluffy6017

Vacant2333 commented 8 months ago

u can assign the issue to me , i will follow it

Vacant2333 commented 8 months ago

Do you think it's possible that this is the issue? would you try this method to resoulve it? @wklken

theweakgod commented 8 months ago

@wklken I think this will help you. #9545 nginx-lua-prometheus memory leak fix You can solve this problem by upgrading the version.

wklken commented 8 months ago

Do you think it's possible that this is the issue? would you try this method to resoulve it? @wklken

@Vacant2333 we don't use the service discovery on production.

wklken commented 8 months ago

@wklken I think this will help you. https://github.com/apache/apisix/pull/9545 [https://github.com/[knyar/nginx-lua-prometheus/pull/151](https://github.com/knyar/nginx-lua-prometheus/pull/151)](nginx-lua-prometheus memory leak fix) You can solve this problem by upgrading the version.

thanks @theweakgod , I will check that.(apisix 3.2.1 use the nginx-lua-prometheus = 0.20220527)

Vacant2333 commented 8 months ago

@wklken I think this will help you. https://github.com/apache/apisix/pull/9545 [https://github.com/[knyar/nginx-lua-prometheus/pull/151](https://github.com/knyar/nginx-lua-prometheus/pull/151)](nginx-lua-prometheus memory leak fix) You can solve this problem by upgrading the version.

thanks @theweakgod , I will check that.(apisix 3.2.1 use the nginx-lua-prometheus = 0.20220527)

It does seem to be one of the reasons. Is it possible to test this possibility (upgrade nginx-lua-prometheus) and see if memory continues to grow? how londgwill this take?

theweakgod commented 8 months ago

@wklken我想这会对你有帮助。https://github.com/apache/apisix/pull/9545 https://github.com/knyar/nginx-lua-prometheus/pull/151你可以解决这个问题升级版本出现问题。

谢谢@theweakgod,我会检查一下。(apisix 3.2.1 使用nginx-lua-prometheus = 0.20220527

Has the problem been solved?

wenj91 commented 7 months ago

@wklken 这个问题有什么进展不?

wklken commented 7 months ago

@wklken 这个问题有什么进展不?

不能在生产上验, 暂时没有同样的环境可以验证, 需要想办法复现生产一样的流量压一段时间(近期都不一定有时间处理; 验证后我会更新到这个issue)

It cannot be tested in production. We do not have the same environment to verify for the time being. We need to find a way to reproduce the same traffic pressure in production for a period of time (I may not have time to deal with it in the near future; I will update to this issue after verification).

theweakgod commented 6 months ago

@wklken Has the problem been solved?

wenj91 commented 6 months ago

@wklken 这个问题有什么进展不?

不能在生产上验, 暂时没有同样的环境可以验证, 需要想办法复现生产一样的流量压一段时间(近期都不一定有时间处理; 验证后我会更新到这个issue)

It cannot be tested in production. We do not have the same environment to verify for the time being. We need to find a way to reproduce the same traffic pressure in production for a period of time (I may not have time to deal with it in the near future; I will update to this issue after verification).

提供个线索,在上传图片跟上传文件接口特别容易出现这种现象

wklken commented 6 months ago
image

We rolling update another release, and the memory didn't increase after about 1 week.


@theweakgod I still can't reproduce the memory increasing on my own cluster yet, will try again later.

theweakgod commented 6 months ago

@theweakgod I still can't reproduce the memory increasing on my own cluster yet, will try again later.

👌

theweakgod commented 6 months ago

@theweakgod I still can't reproduce the memory increasing.

need huge metrics

wklken commented 6 months ago

but from the

image

We rolling update another release, and the memory didn't increase after about 1 week.

@theweakgod I still can't reproduce the memory increasing on my own cluster yet, will try again later.

Here's a revised version of the text with corrected grammar and improved clarity:

From the provided chart:

If the pull request bugfix: limit lookup table size is effective, the memory usage should not exceed 5.59 GB and should only show an increase for no more than 7 days.