bug: apisix 内存管理-过期监控key不释放问题

wood-zhang commented 1 year ago

Current Behavior

如果打开了prometheus插件，并且upstream使用了k8s服务发现或者upstream ip随着发布而改变的话，在apisix中就会产生过多的监控key，从而导致内存不断增长，如果不重启apisix最终OOM ![Uploading image.png…]()

Expected Behavior

我期望在upstream ip改变的时候有一个自动检测机制，把内存里的监控指标中不存在的node节点的key进行释放

Error Logs

No response

Steps to Reproduce

我通过将指标中的node纬度关闭从而规避了因发布而导致的upstream ip 改变产生过多key的问题

Environment

APISIX version (run apisix version):
Operating system (run uname -a):
OpenResty / Nginx version (run openresty -V or nginx -V):
etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
APISIX Dashboard version, if relevant:
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

kingluo commented 1 year ago

The metrics will be flushed (eventually) to shared memory, and the shared memory is sized lru cache (i.e. eviction when the cache is full), which is not counted into the nginx memory, so no worry about the OOM. The only worry point is the lookup cache for each metric instance, but that's also sized in the latest version.

Which version of APISIX do you use? And show your configuration please.

wood-zhang commented 1 year ago

version：3.3.0

wood-zhang commented 1 year ago

apisix:    # universal configurations
  node_listen:
    - port: 9080    # APISIX listening port
      enable_http2: false
    - port: 9081
      enable_http2: true
  enable_heartbeat: true
  enable_admin: true
  enable_admin_cors: true
  enable_debug: false

  enable_dev_mode: false                       # Sets nginx worker_processes to 1 if set to true
  enable_reuseport: true                       # Enable nginx SO_REUSEPORT switch if set to true.
  enable_ipv6: true # Enable nginx IPv6 resolver
  enable_server_tokens: false # Whether the APISIX version number should be shown in Server header

  # proxy_protocol:                   # Proxy Protocol configuration
  #   listen_http_port: 9181          # The port with proxy protocol for http, it differs from node_listen and admin_listen.
  #                                   # This port can only receive http request with proxy protocol, but node_listen & admin_listen
  #                                   # can only receive http request. If you enable proxy protocol, you must use this port to
  #                                   # receive http request with proxy protocol
  #   listen_https_port: 9182         # The port with proxy protocol for https
  #   enable_tcp_pp: true             # Enable the proxy protocol for tcp proxy, it works for stream_proxy.tcp option
  #   enable_tcp_pp_to_upstream: true # Enables the proxy protocol to the upstream server

  proxy_cache:                         # Proxy Caching configuration
    cache_ttl: 10s                     # The default caching time if the upstream does not specify the cache time
    zones:                             # The parameters of a cache
    - name: disk_cache_one             # The name of the cache, administrator can be specify
                                       # which cache to use by name in the admin api
      memory_size: 50m                 # The size of shared memory, it's used to store the cache index
      disk_size: 1G                    # The size of disk, it's used to store the cache data
      disk_path: "/tmp/disk_cache_one" # The path to store the cache data
      cache_levels: "1:2"              # The hierarchy levels of a cache
  #  - name: disk_cache_two
  #    memory_size: 50m
  #    disk_size: 1G
  #    disk_path: "/tmp/disk_cache_two"
  #    cache_levels: "1:2"

  router:
    http: radixtree_uri  # radixtree_uri: match route by uri(base on radixtree)
                                # radixtree_host_uri: match route by host + uri(base on radixtree)
                                # radixtree_uri_with_parameter: match route by uri with parameters
    ssl: 'radixtree_sni'        # radixtree_sni: match route by SNI(base on radixtree)
  stream_proxy:                 # TCP/UDP proxy
    only: false
    tcp:                        # TCP proxy port list
      - 8001
  # dns_resolver:
  #
  #   - 127.0.0.1
  #
  #   - 172.20.0.10
  #
  #   - 114.114.114.114
  #
  #   - 223.5.5.5
  #
  #   - 1.1.1.1
  #
  #   - 8.8.8.8
  #
  dns_resolver_valid: 30
  resolver_timeout: 5
  ssl:
    enable: true
    listen:
      - port: 9443
        enable_http2: true
    ssl_protocols: "TLSv1.2 TLSv1.3"
    ssl_ciphers: "xxxxx"
    ssl_trusted_certificate: "/etcd-ssl/ca.pem"

nginx_config:    # config for render the template to genarate nginx.conf
  http_server_configuration_snippet: |
    proxy_ignore_client_abort on;
  error_log: "/dev/stderr"
  error_log_level: "error"    # warn,error
  worker_processes: "8"
  enable_cpu_affinity: true
  worker_rlimit_nofile: 102400  # the number of files a worker process can open, should be larger than worker_connections
  event:
    worker_connections: 65535
  http:
    enable_access_log: true
    access_log: "/dev/stdout"
    access_log_format: '{\"timestamp\":\"$time_iso8601\",\"server_addr\":\"$server_addr\",\"remote_addr\":\"$remote_addr\",\"remote_port\":\"$realip_remote_port\",\"all_cookie\":\"$http_cookie\",\"http_host\":\"$http_host\",\"query_string\":\"$query_string\",\"request_method\":\"$request_method\",\"uri\":\"$uri\",\"service\":\"apisix_backend\",\"request_uri\":\"$request_uri\",\"status\":\"$status\",\"body_bytes_sent\":\"$body_bytes_sent\",\"request_time\":\"$request_time\",\"upstream_response_time\":\"$upstream_response_time\",\"upstream_addr\":\"$upstream_addr\",\"upstream_status\":\"$upstream_status\",\"http_referer\":\"$http_referer\",\"http_user_agent\":\"$http_user_agent\",\"http_x_forwarded_for\":\"$http_x_forwarded_for\",\"spanId\":\"$http_X_B3_SpanId\",\"http_token\":\"$http_token\",\"http_authorizationv2\":\"$http_authorizationv2\",\"content-type\":\"$content_type\",\"content-length\":\"$content_length\",\"traceId\":\"$http_X_B3_TraceId\"}'
    access_log_format_escape: json
    lua_shared_dict:
      prometheus-metrics: 800m
      discovery: 300m
      kubernetes: 200m

    keepalive_timeout: 60s         # timeout during which a keep-alive client connection will stay open on the server side.
    client_header_timeout: 60s     # timeout for reading client request header, then 408 (Request Time-out) error is returned to the client
    client_body_timeout: 60s       # timeout for reading client request body, then 408 (Request Time-out) error is returned to the client
    send_timeout: 10s              # timeout for transmitting a response to the client.then the connection is closed
    underscores_in_headers: "on"   # default enables the use of underscores in client request header fields
    real_ip_header: "X-Forwarded-For"    # http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_header
    real_ip_recursive: on # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
    #real_ip_from:                  # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
    #  - 127.0.0.1
    #  - 'unix:'
    real_ip_from:
      - 127.0.0.1/24
      - 'unix:'      
      - 10.28.0.0/14
      - 10.32.0.0/17
discovery:
  kubernetes:
    client:
      token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    service:
      host: ${KUBERNETES_SERVICE_HOST}
      port: ${KUBERNETES_SERVICE_PORT}
      schema: https
plugins:    # plugin list
  - api-breaker
  - authz-keycloak
  - basic-auth
  - batch-requests
  - consumer-restriction
  - cors
  - client-control
  - echo
  - fault-injection
  - file-logger
  - grpc-transcode
  - grpc-web
  - hmac-auth
  - http-logger
  - ip-restriction
  - ua-restriction
  - jwt-auth
  - kafka-logger
  - key-auth
  - limit-conn
  - limit-count
  - limit-req
  - node-status
  - openid-connect
  - authz-casbin
  - prometheus
  - proxy-cache
  - proxy-mirror
  - proxy-rewrite
  - redirect
  - referer-restriction
  - request-id
  - request-validation
  - response-rewrite
  - serverless-post-function
  - serverless-pre-function
  - sls-logger
  - syslog
  - tcp-logger
  - udp-logger
  - uri-blocker
  - wolf-rbac
  - zipkin
  - traffic-split
  - gzip
  - real-ip
  - ext-plugin-pre-req
  - ext-plugin-post-req
stream_plugins:
  - mqtt-proxy
  - ip-restriction
  - limit-conn
plugin_attr:
  prometheus:
    enable_export_server: true
    export_addr:
      ip: 0.0.0.0
      port: 9091
    export_uri: /apisix/prometheus/metrics
    metric_prefix: apisix_

deployment:
  role: traditional
  role_traditional:
    config_provider: etcd

  admin:
    allow_admin:    # http://nginx.org/en/docs/http/ngx_http_access_module.html#allow
      - 127.0.0.1/24
      - 172.16.174.0/24

    #   - "::/64"
    admin_listen:
      ip: 0.0.0.0
      port: 9180
    # Default token when use API to call for Admin API.
    # *NOTE*: Highly recommended to modify this value to protect APISIX's Admin API.
    # Disabling this configuration item means that the Admin API does not
    # require any authentication.
    admin_key:
      # admin: can everything for configuration data
      - name: "admin"
        key: xxxxx
        role: admin
      # viewer: only can view configuration data
      - name: "viewer"
        key: xxxxx
        role: viewer
    https_admin: false
    admin_api_mtls:
      admin_ssl_ca_cert: "/etcd-ssl/ca.pem"
      admin_ssl_cert: "/etcd-ssl/etcd.pem"
      admin_ssl_cert_key: "/etcd-ssl/etcd-key.pem"
  etcd:
    host:                          # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
      - "https://xx.xx:2379"             # multiple etcd address
    prefix: "/apisix"    # configuration prefix in etcd
    timeout: 30    # 30 seconds
    tls:
      ssl_trusted_certificate: "/etcd-ssl/ca.pem"
      cert: "/etcd-ssl/etcd.pem"
      key: "/etcd-ssl/etcd-key.pem"
      verify: true
      sni: "xxx.com"

hansedong commented 1 year ago

It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:

When there are more routes and upstreams, the metrics data grows exponentially.
Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.
Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered, Metrics Error will continue to increase. We hope that Metrics Error can help us identify issues in a reasonable manner.
Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.

In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times.

Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else.

before:

metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))},  
    buckets)

after：

metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", unpack(extra_labels("http_latency"))},  
    buckets)

moonming commented 8 months ago

It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:

When there are more routes and upstreams, the metrics data grows exponentially.

Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.

Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered, Metrics Error will continue to increase. We hope that Metrics Error can help us identify issues in a reasonable manner.

Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.

In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times.

Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else.

before:
metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))},  
    buckets)
after：
metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", unpack(extra_labels("http_latency"))},  
    buckets)

@hansedong well said 👍 Only retaining type and route level data is not a universal implementation idea, and other users may not accept it.

We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?

hansedong commented 8 months ago

@hansedong well said 👍 Only retaining type and route level data is not a universal implementation idea, and other users may not accept it.

We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?

This is a good idea. As I understand it, the TTL mechanism can preserve data for specific metrics (which are updated regularly) and also allow for the deletion of expired metrics. If this feature is released, I am willing to do some testing.

Sn0rt commented 8 months ago

hi. this case has been reproduced by test case. pls take a look. now. the test case 1 can't pass now. the test case 2 can pass.

Sn0rt commented 8 months ago

It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:

When there are more routes and upstreams, the metrics data grows exponentially.

Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.

Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered, Metrics Error will continue to increase. We hope that Metrics Error can help us identify issues in a reasonable manner.

Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.

In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times. Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else. before:
metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))},  
    buckets)
after：
metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", unpack(extra_labels("http_latency"))},  
    buckets)
@hansedong well said 👍 Only retaining type and route level data is not a universal implementation idea, and other users may not accept it.

We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?

I think this ttl solution is a bit troublesome, because the upstream Prometheus library does not set the ttl parameter location when setting the vaule. If you want the ttl solution, you need to change the upstream library.

The more important point is that latency is a histogram data type, and ttl cannot be used to automatically reclaim resources. For example, if the ttl is 1h, if an upstram has not been accessed within 1h, then the latency corresponding to the node needs to be cleared. The summary data corresponding to the latency will be inaccurate next time the node is accessed.

Sn0rt commented 8 months ago

I have inspect I found kong looks has same problem

maybe we need a timer to maintain the share dict
Or ignore this problem like kong and no longer ignore the upstream field. Avoid occupying a lot of memory.

hansedong commented 8 months ago

@Sn0rt

I feel that if we ignore this problem, it also means that the problem that the metrics data keeps increasing cannot be solved. If the metrics are too large, there is still the problem that the privileged worker calculates the CPU time. Before this, our online environment (before APISIX transferred the Metrics computation to a special privileged worker) would restart APISIX every once in a while
If it's not easy to introduce the TTL mechanism in the current APISIX, is it possible to dynamically delete the relevant Metrics when upstream and route are deleted?

Sn0rt commented 8 months ago

@Sn0rt

I feel that if we ignore this problem, it also means that the problem that the metrics data keeps increasing cannot be solved. If the metrics are too large, there is still the problem that the privileged worker calculates the CPU time. Before this, our online environment (before APISIX transferred the Metrics computation to a special privileged worker) would restart APISIX every once in a while

If it's not easy to introduce the TTL mechanism in the current APISIX, is it possible to dynamically delete the relevant Metrics when upstream and route are deleted?

Thank you for your continued attention. After discussion with @membphis, I found that my previous understanding of metrics was wrong.

We use the TTL scheme to recycle metrics that have not been reset for a long time, which has no impact on grafana's display.

liugang594 commented 8 months ago

It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:

When there are more routes and upstreams, the metrics data grows exponentially.

Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.

Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered, Metrics Error will continue to increase. We hope that Metrics Error can help us identify issues in a reasonable manner.

Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.

In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times. Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else. before:
metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))},  
    buckets)
after：
metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", unpack(extra_labels("http_latency"))},  
    buckets)
@hansedong well said 👍 Only retaining type and route level data is not a universal implementation idea, and other users may not accept it.

We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?

Do we have plan for TTL ?

Sn0rt commented 8 months ago

It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:

When there are more routes and upstreams, the metrics data grows exponentially.

Due to APISIX Metrics only increasing and not decreasing, historical data keeps accumulating.

Although there is an LRU mechanism to control the size of Prometheus Lua shared memory within the set value, this is not a fundamental solution. Once the LRU mechanism is triggered, Metrics Error will continue to increase. We hope that Metrics Error can help us identify issues in a reasonable manner.

Although in the new version, APISIX has moved the server for prometheus metrics exporter to a privileged process, reducing P100 issues, due to Metrics only increasing and not decreasing, it also puts significant pressure on the privilege server. In our production environment as an example, we have 150k metrics data points. Each time Prometheus pulls data it causes Nginx worker CPU usage to reach 100% and lasts for about 5-10 seconds.

In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times. Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else. before:
metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))},  
    buckets)
after：
metrics.latency = prometheus:histogram("http_latency",  
    "HTTP request latency in milliseconds per service in APISIX",  
    {"type", "route", unpack(extra_labels("http_latency"))},  
    buckets)
@hansedong well said 👍 Only retaining type and route level data is not a universal implementation idea, and other users may not accept it. We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think?
Do we have plan for TTL ?

APISIX uses the knyar/nginx-lua-prometheus library to set the metric. The ttl solution would be better if it is supported by the underlying library.

Currently in discussion with the maintainer of knyar/nginx-lua-prometheus https://github.com/knyar/nginx-lua-prometheus/issues/164, in any case this issue is already being advanced.

liugang594 commented 8 months ago

img_v3_026h_d21401b8-7984-4a52-bbdb-9f2f90444deg I tried it and so some simple test, it looks good.

Sn0rt commented 8 months ago

I tried it and so some simple test, it looks good.

Judging from your implementation, you should not implement ttl in the underlying library knyar/nginx-lua-prometheus, right? Instead, it traverses the shared memory used by prometheus and sets exptime to achieve it?

Sn0rt commented 8 months ago

I'll re-evaluate and the fault report of my task is wrong.

First of all, it is true that multiple changes to the upstream will affect the metric, but these metrics are all placed in the share memory. If the share dict is full, it will be automatically evicted according to LRU.
The second one has no evidence that oom is caused by prometheus

ngx.shared.DICT.set syntax: success, err, forcible = ngx.shared.DICT:set(key, value, exptime?, flags?)

context: init_by_lua, set_by_lua, rewrite_by_lua, access_by_lua, content_by_lua, header_filter_by_lua, body_filter_by_lua, log_by_lua, ngx.timer., balancer_by_lua, ssl_certificate_by_lua, ssl_session_fetch_by_lua , ssl_session_store_by_lua, ssl_client_hello_by_lua

Unconditionally sets a key-value pair into the shm-based dictionary ngx.shared.DICT. Returns three values:

success: boolean value to indicate whether the key-value pair is stored or not. err: textual error message, can be "no memory". forcible: a boolean value to indicate whether other valid items have been removed forcibly when out of storage in the shared memory zone.

@liugang594 @hansedong

monkeyDluffy6017 commented 6 months ago

@hansedong The TTL feature is merged, would you like to do some testing?

hansedong commented 6 months ago

@hansedong The TTL feature is merged, would you like to do some testing?

Yes I'd love to, I plan to upgrade one APISIX gateway of the microservice scenario to test the effect of the new feature

god12311 commented 4 months ago

@moonming

god12311 commented 4 months ago

@moonming Hello, this problem has been fixed in version 3.9.0. When will it be updated in version 3.2.2?

moonming commented 4 months ago

@moonming Hello, this problem has been fixed in version 3.9.0. When will it be updated in version 3.2.2?

No, we will keep new features and bug fixes in the master branch.

god12311 commented 4 months ago

@moonming Hello, this problem has been fixed in version 3.9.0. When will it be updated in version 3.2.2?

No, we will keep new features and bug fixes in the master branch.

If it is only fixed in the new version, as a long-term support version, how should we solve this kind of problem that affects production stability? Should we consider updating?

xuzonghao commented 1 month ago

The 3.9 version of apisix still has the issue of the http_status upstream_status key not being updated. When can this be resolved？

apache / apisix