alibaba / Sentinel

A powerful flow control component enabling reliability, resilience and monitoring for microservices. (面向云原生微服务的高可用流控防护组件)
https://sentinelguard.io/
Apache License 2.0
22.34k stars 8.01k forks source link

MetricSearcher#findOffset eating 5%~15% CPU #1409

Open starry-eyed-art opened 4 years ago

starry-eyed-art commented 4 years ago

Issue Description

MetricSearcher findOffset eating 5%~15% CPU,because of DataInputStream#readLong image-20200417170907416 image-20200417170544934

bug report

Describe what happened

当MetricSearcher工作时,会占用5%~15%的CPU。访问量小的应用没有这种问题,但是部分访问量较高的应用会偶发这个问题,具体访问量是比如一分钟20万次被Sentinel管控的请求。我们有一些用应用单机QPS超过4万,这个问题便会持续出现

When MetricSearcher works, it consumes 5% to 15% of the CPU. There is no such problem with applications with small traffic, but some applications with high traffic will have this problem occasionally. The specific traffic is for example 200,000 requests per minute controlled by Sentinel. We have some applications with a single machine QPS exceeding 40,000, this problem will continue to appear

Describe what you expected to happen

我们已经在生产环境大规模使用了Sentinel,并且已经基于Sentinel做了内部的版本。官方是否有可能优化这个问题,确实单个线程占用这么高的CPU很难接受。

We have used Sentinel on a large scale in the production environment, and have already made an internal version based on Sentinel. Is it possible for the official to optimize this problem, it is indeed difficult to accept that a single thread occupies such a high CPU.

How to reproduce it (as minimally and precisely as possible)

分析工具:top+jstack+IBMThread And Monitor Dump Analyzer

Tell us your environment

Sentinel V1.6.3

hezhaoye commented 2 years ago

这个官方有优化计划吗

sczyh30 commented 2 years ago

欢迎社区一起分析优化,结合一些 profiler 信息

justlau commented 1 year ago

这个官方有优化计划吗

可以考虑将监控数据通过别的方式进行收集,比如说将监控数据写入kafka之类的地方,避免监控数据落盘带来的IO开销

linxiaobai commented 3 months ago

我把服务端的采集给停了,客户端应用所在机器的CPU使用率降了30%,load 也降了近10。

这个采集的功能,会导致机器负载飙升,负载高了IO性能也跟不上,客户端那边采集的线程就BLOCKED在那。

java.lang.Thread.State: BLOCKED (on object monitor)
        at com.alibaba.csp.sentinel.node.metric.MetricSearcher.findByTimeAndResource(MetricSearcher.java:115)
        - waiting to lock <0x0000000724b9afc0> (a com.alibaba.csp.sentinel.node.metric.MetricSearcher)
        at com.alibaba.csp.sentinel.command.handler.SendMetricCommandHandler.handle(SendMetricCommandHandler.java:80)
        at com.alibaba.csp.sentinel.transport.command.http.HttpEventTask.run(HttpEventTask.java:103)

我们目前客户端SDK也通过metrics进行了采集,对比下来服务端通过endpoint读磁盘采集的方案不是很好,太费资源了。