Closed wood-zhang closed 6 months ago
The metrics will be flushed (eventually) to shared memory, and the shared memory is sized lru cache (i.e. eviction when the cache is full), which is not counted into the nginx memory, so no worry about the OOM. The only worry point is the lookup cache for each metric instance, but that's also sized in the latest version.
Which version of APISIX do you use? And show your configuration please.
version:3.3.0
apisix: # universal configurations
node_listen:
- port: 9080 # APISIX listening port
enable_http2: false
- port: 9081
enable_http2: true
enable_heartbeat: true
enable_admin: true
enable_admin_cors: true
enable_debug: false
enable_dev_mode: false # Sets nginx worker_processes to 1 if set to true
enable_reuseport: true # Enable nginx SO_REUSEPORT switch if set to true.
enable_ipv6: true # Enable nginx IPv6 resolver
enable_server_tokens: false # Whether the APISIX version number should be shown in Server header
# proxy_protocol: # Proxy Protocol configuration
# listen_http_port: 9181 # The port with proxy protocol for http, it differs from node_listen and admin_listen.
# # This port can only receive http request with proxy protocol, but node_listen & admin_listen
# # can only receive http request. If you enable proxy protocol, you must use this port to
# # receive http request with proxy protocol
# listen_https_port: 9182 # The port with proxy protocol for https
# enable_tcp_pp: true # Enable the proxy protocol for tcp proxy, it works for stream_proxy.tcp option
# enable_tcp_pp_to_upstream: true # Enables the proxy protocol to the upstream server
proxy_cache: # Proxy Caching configuration
cache_ttl: 10s # The default caching time if the upstream does not specify the cache time
zones: # The parameters of a cache
- name: disk_cache_one # The name of the cache, administrator can be specify
# which cache to use by name in the admin api
memory_size: 50m # The size of shared memory, it's used to store the cache index
disk_size: 1G # The size of disk, it's used to store the cache data
disk_path: "/tmp/disk_cache_one" # The path to store the cache data
cache_levels: "1:2" # The hierarchy levels of a cache
# - name: disk_cache_two
# memory_size: 50m
# disk_size: 1G
# disk_path: "/tmp/disk_cache_two"
# cache_levels: "1:2"
router:
http: radixtree_uri # radixtree_uri: match route by uri(base on radixtree)
# radixtree_host_uri: match route by host + uri(base on radixtree)
# radixtree_uri_with_parameter: match route by uri with parameters
ssl: 'radixtree_sni' # radixtree_sni: match route by SNI(base on radixtree)
stream_proxy: # TCP/UDP proxy
only: false
tcp: # TCP proxy port list
- 8001
# dns_resolver:
#
# - 127.0.0.1
#
# - 172.20.0.10
#
# - 114.114.114.114
#
# - 223.5.5.5
#
# - 1.1.1.1
#
# - 8.8.8.8
#
dns_resolver_valid: 30
resolver_timeout: 5
ssl:
enable: true
listen:
- port: 9443
enable_http2: true
ssl_protocols: "TLSv1.2 TLSv1.3"
ssl_ciphers: "xxxxx"
ssl_trusted_certificate: "/etcd-ssl/ca.pem"
nginx_config: # config for render the template to genarate nginx.conf
http_server_configuration_snippet: |
proxy_ignore_client_abort on;
error_log: "/dev/stderr"
error_log_level: "error" # warn,error
worker_processes: "8"
enable_cpu_affinity: true
worker_rlimit_nofile: 102400 # the number of files a worker process can open, should be larger than worker_connections
event:
worker_connections: 65535
http:
enable_access_log: true
access_log: "/dev/stdout"
access_log_format: '{\"timestamp\":\"$time_iso8601\",\"server_addr\":\"$server_addr\",\"remote_addr\":\"$remote_addr\",\"remote_port\":\"$realip_remote_port\",\"all_cookie\":\"$http_cookie\",\"http_host\":\"$http_host\",\"query_string\":\"$query_string\",\"request_method\":\"$request_method\",\"uri\":\"$uri\",\"service\":\"apisix_backend\",\"request_uri\":\"$request_uri\",\"status\":\"$status\",\"body_bytes_sent\":\"$body_bytes_sent\",\"request_time\":\"$request_time\",\"upstream_response_time\":\"$upstream_response_time\",\"upstream_addr\":\"$upstream_addr\",\"upstream_status\":\"$upstream_status\",\"http_referer\":\"$http_referer\",\"http_user_agent\":\"$http_user_agent\",\"http_x_forwarded_for\":\"$http_x_forwarded_for\",\"spanId\":\"$http_X_B3_SpanId\",\"http_token\":\"$http_token\",\"http_authorizationv2\":\"$http_authorizationv2\",\"content-type\":\"$content_type\",\"content-length\":\"$content_length\",\"traceId\":\"$http_X_B3_TraceId\"}'
access_log_format_escape: json
lua_shared_dict:
prometheus-metrics: 800m
discovery: 300m
kubernetes: 200m
keepalive_timeout: 60s # timeout during which a keep-alive client connection will stay open on the server side.
client_header_timeout: 60s # timeout for reading client request header, then 408 (Request Time-out) error is returned to the client
client_body_timeout: 60s # timeout for reading client request body, then 408 (Request Time-out) error is returned to the client
send_timeout: 10s # timeout for transmitting a response to the client.then the connection is closed
underscores_in_headers: "on" # default enables the use of underscores in client request header fields
real_ip_header: "X-Forwarded-For" # http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_header
real_ip_recursive: on # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
#real_ip_from: # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
# - 127.0.0.1
# - 'unix:'
real_ip_from:
- 127.0.0.1/24
- 'unix:'
- 10.28.0.0/14
- 10.32.0.0/17
discovery:
kubernetes:
client:
token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
service:
host: ${KUBERNETES_SERVICE_HOST}
port: ${KUBERNETES_SERVICE_PORT}
schema: https
plugins: # plugin list
- api-breaker
- authz-keycloak
- basic-auth
- batch-requests
- consumer-restriction
- cors
- client-control
- echo
- fault-injection
- file-logger
- grpc-transcode
- grpc-web
- hmac-auth
- http-logger
- ip-restriction
- ua-restriction
- jwt-auth
- kafka-logger
- key-auth
- limit-conn
- limit-count
- limit-req
- node-status
- openid-connect
- authz-casbin
- prometheus
- proxy-cache
- proxy-mirror
- proxy-rewrite
- redirect
- referer-restriction
- request-id
- request-validation
- response-rewrite
- serverless-post-function
- serverless-pre-function
- sls-logger
- syslog
- tcp-logger
- udp-logger
- uri-blocker
- wolf-rbac
- zipkin
- traffic-split
- gzip
- real-ip
- ext-plugin-pre-req
- ext-plugin-post-req
stream_plugins:
- mqtt-proxy
- ip-restriction
- limit-conn
plugin_attr:
prometheus:
enable_export_server: true
export_addr:
ip: 0.0.0.0
port: 9091
export_uri: /apisix/prometheus/metrics
metric_prefix: apisix_
deployment:
role: traditional
role_traditional:
config_provider: etcd
admin:
allow_admin: # http://nginx.org/en/docs/http/ngx_http_access_module.html#allow
- 127.0.0.1/24
- 172.16.174.0/24
# - "::/64"
admin_listen:
ip: 0.0.0.0
port: 9180
# Default token when use API to call for Admin API.
# *NOTE*: Highly recommended to modify this value to protect APISIX's Admin API.
# Disabling this configuration item means that the Admin API does not
# require any authentication.
admin_key:
# admin: can everything for configuration data
- name: "admin"
key: xxxxx
role: admin
# viewer: only can view configuration data
- name: "viewer"
key: xxxxx
role: viewer
https_admin: false
admin_api_mtls:
admin_ssl_ca_cert: "/etcd-ssl/ca.pem"
admin_ssl_cert: "/etcd-ssl/etcd.pem"
admin_ssl_cert_key: "/etcd-ssl/etcd-key.pem"
etcd:
host: # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
- "https://xx.xx:2379" # multiple etcd address
prefix: "/apisix" # configuration prefix in etcd
timeout: 30 # 30 seconds
tls:
ssl_trusted_certificate: "/etcd-ssl/ca.pem"
cert: "/etcd-ssl/etcd.pem"
key: "/etcd-ssl/etcd-key.pem"
verify: true
sni: "xxx.com"
It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:
When there are more routes and upstreams, the metrics data grows exponentially.
Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.
Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered, Metrics Error
will continue to increase. We hope that Metrics Error
can help us identify issues in a reasonable manner.
Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.
In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times.
Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else.
before:
metrics.latency = prometheus:histogram("http_latency",
"HTTP request latency in milliseconds per service in APISIX",
{"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))},
buckets)
after:
metrics.latency = prometheus:histogram("http_latency",
"HTTP request latency in milliseconds per service in APISIX",
{"type", "route", unpack(extra_labels("http_latency"))},
buckets)
It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:
- When there are more routes and upstreams, the metrics data grows exponentially.
- Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.
- Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered,
Metrics Error
will continue to increase. We hope thatMetrics Error
can help us identify issues in a reasonable manner.- Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.
In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times.
Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else.
before:
metrics.latency = prometheus:histogram("http_latency", "HTTP request latency in milliseconds per service in APISIX", {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))}, buckets)
after:
metrics.latency = prometheus:histogram("http_latency", "HTTP request latency in milliseconds per service in APISIX", {"type", "route", unpack(extra_labels("http_latency"))}, buckets)
@hansedong well said 👍
Only retaining type
and route
level data is not a universal implementation idea, and other users may not accept it.
We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?
@hansedong well said 👍 Only retaining
type
androute
level data is not a universal implementation idea, and other users may not accept it.We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?
This is a good idea. As I understand it, the TTL mechanism can preserve data for specific metrics (which are updated regularly) and also allow for the deletion of expired metrics. If this feature is released, I am willing to do some testing.
hi. this case has been reproduced by test case. pls take a look. now. the test case 1 can't pass now. the test case 2 can pass.
It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:
- When there are more routes and upstreams, the metrics data grows exponentially.
- Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.
- Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered,
Metrics Error
will continue to increase. We hope thatMetrics Error
can help us identify issues in a reasonable manner.- Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.
In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times. Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else. before:
metrics.latency = prometheus:histogram("http_latency", "HTTP request latency in milliseconds per service in APISIX", {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))}, buckets)
after:
metrics.latency = prometheus:histogram("http_latency", "HTTP request latency in milliseconds per service in APISIX", {"type", "route", unpack(extra_labels("http_latency"))}, buckets)
@hansedong well said 👍 Only retaining
type
androute
level data is not a universal implementation idea, and other users may not accept it.We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?
I think this ttl solution is a bit troublesome, because the upstream Prometheus library does not set the ttl parameter location when setting the vaule. If you want the ttl solution, you need to change the upstream library.
The more important point is that latency is a histogram data type, and ttl cannot be used to automatically reclaim resources. For example, if the ttl is 1h, if an upstram has not been accessed within 1h, then the latency corresponding to the node needs to be cleared. The summary data corresponding to the latency will be inaccurate next time the node is accessed.
I have inspect I found kong looks has same problem
@Sn0rt
@Sn0rt
- I feel that if we ignore this problem, it also means that the problem that the metrics data keeps increasing cannot be solved. If the metrics are too large, there is still the problem that the privileged worker calculates the CPU time. Before this, our online environment (before APISIX transferred the Metrics computation to a special privileged worker) would restart APISIX every once in a while
- If it's not easy to introduce the TTL mechanism in the current APISIX, is it possible to dynamically delete the relevant Metrics when upstream and route are deleted?
Thank you for your continued attention. After discussion with @membphis, I found that my previous understanding of metrics was wrong.
We use the TTL scheme to recycle metrics that have not been reset for a long time, which has no impact on grafana's display.
It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:
- When there are more routes and upstreams, the metrics data grows exponentially.
- Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.
- Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered,
Metrics Error
will continue to increase. We hope thatMetrics Error
can help us identify issues in a reasonable manner.- Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.
In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times. Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else. before:
metrics.latency = prometheus:histogram("http_latency", "HTTP request latency in milliseconds per service in APISIX", {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))}, buckets)
after:
metrics.latency = prometheus:histogram("http_latency", "HTTP request latency in milliseconds per service in APISIX", {"type", "route", unpack(extra_labels("http_latency"))}, buckets)
@hansedong well said 👍 Only retaining
type
androute
level data is not a universal implementation idea, and other users may not accept it.We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?
Do we have plan for TTL ?
It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:
- When there are more routes and upstreams, the metrics data grows exponentially.
- Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.
- Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered,
Metrics Error
will continue to increase. We hope thatMetrics Error
can help us identify issues in a reasonable manner.- Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.
In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times. Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else. before:
metrics.latency = prometheus:histogram("http_latency", "HTTP request latency in milliseconds per service in APISIX", {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))}, buckets)
after:
metrics.latency = prometheus:histogram("http_latency", "HTTP request latency in milliseconds per service in APISIX", {"type", "route", unpack(extra_labels("http_latency"))}, buckets)
@hansedong well said 👍 Only retaining
type
androute
level data is not a universal implementation idea, and other users may not accept it. We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?Do we have plan for TTL ?
APISIX uses the knyar/nginx-lua-prometheus library to set the metric. The ttl solution would be better if it is supported by the underlying library.
Currently in discussion with the maintainer of knyar/nginx-lua-prometheus https://github.com/knyar/nginx-lua-prometheus/issues/164, in any case this issue is already being advanced.
I tried it and so some simple test, it looks good.
I tried it and so some simple test, it looks good.
Judging from your implementation, you should not implement ttl in the underlying library knyar/nginx-lua-prometheus, right? Instead, it traverses the shared memory used by prometheus and sets exptime to achieve it?
I'll re-evaluate and the fault report of my task is wrong.
First of all, it is true that multiple changes to the upstream will affect the metric, but these metrics are all placed in the share memory. If the share dict is full, it will be automatically evicted according to LRU.
The second one has no evidence that oom is caused by prometheus
ngx.shared.DICT.set syntax: success, err, forcible = ngx.shared.DICT:set(key, value, exptime?, flags?)
context: init_by_lua, set_by_lua, rewrite_by_lua, access_by_lua, content_by_lua, header_filter_by_lua, body_filter_by_lua, log_by_lua, ngx.timer., balancer_by_lua, ssl_certificate_by_lua, ssl_session_fetch_by_lua , ssl_session_store_by_lua, ssl_client_hello_by_lua
Unconditionally sets a key-value pair into the shm-based dictionary ngx.shared.DICT. Returns three values:
success: boolean value to indicate whether the key-value pair is stored or not. err: textual error message, can be "no memory". forcible: a boolean value to indicate whether other valid items have been removed forcibly when out of storage in the shared memory zone.
@liugang594 @hansedong
@hansedong The TTL feature is merged, would you like to do some testing?
@hansedong The TTL feature is merged, would you like to do some testing?
Yes I'd love to, I plan to upgrade one APISIX gateway of the microservice scenario to test the effect of the new feature
@moonming
@moonming Hello, this problem has been fixed in version 3.9.0. When will it be updated in version 3.2.2?
@moonming Hello, this problem has been fixed in version 3.9.0. When will it be updated in version 3.2.2?
No, we will keep new features and bug fixes in the master branch.
@moonming Hello, this problem has been fixed in version 3.9.0. When will it be updated in version 3.2.2?
No, we will keep new features and bug fixes in the master branch.
If it is only fixed in the new version, as a long-term support version, how should we solve this kind of problem that affects production stability? Should we consider updating?
The 3.9 version of apisix still has the issue of the http_status upstream_status key not being updated. When can this be resolved?
Current Behavior
如果打开了prometheus插件,并且upstream使用了k8s服务发现或者upstream ip随着发布而改变的话,在apisix中就会产生过多的监控key,从而导致内存不断增长,如果不重启apisix最终OOM ![Uploading image.png…]()
Expected Behavior
我期望在upstream ip改变的时候有一个自动检测机制,把内存里的监控指标中不存在的node节点的key进行释放
Error Logs
No response
Steps to Reproduce
我通过将指标中的node纬度关闭从而规避了因发布而导致的upstream ip 改变产生过多key的问题
Environment
apisix version
):uname -a
):openresty -V
ornginx -V
):curl http://127.0.0.1:9090/v1/server_info
):luarocks --version
):