apache / hertzbeat

Apache HertzBeat(incubating) is a real-time monitoring system with agentless, performance cluster, prometheus-compatible, custom monitoring and status page building capabilities.
https://hertzbeat.apache.org/
Apache License 2.0
5.74k stars 994 forks source link

[Question] <Alarms interrupted, restart service to restore> #2798

Closed ichenyt closed 3 weeks ago

ichenyt commented 3 weeks ago

Question

现象:我用的是1.5版本,我在监控300多个应用的HTTP服务时,执行一段时间就不会告警了,重启项目又恢复了。

跟踪:DispatcherAlarm类中 DispatchTask的run方法,如果发生其它类型异常,没有catch,这个线程就退出了,看源码中总共是3个线程在执行同一个DispatchTask,发生三次异常,整个告警功能就终止了

问题1:如果三个线程都退出了,是不是dataQueue中的其他告警就无法进行通知了? @Override public void run() { while (!Thread.currentThread().isInterrupted()) { try { Alert alert = dataQueue.pollAlertsData(); if (alert != null) { // Determining alarm type storage 判断告警类型入库 alertStoreHandler.store(alert); sendNotify(alert); if (!Objects.isNull(sentryAlert)) { sentryMessageService.sendToSentry(sentryAlert); } } } catch (InterruptedException e) { log.error("An error occurred in DispatcherAlarm DispatchTask", e); Thread.currentThread().interrupt(); } } }

  问题2:线程池设置是否有问题?线程满了会抛出异常,也会拒绝执行新任务
  workerExecutor = new ThreadPoolExecutor(6,
            10,
            10,
            TimeUnit.SECONDS,
            new SynchronousQueue<>(),
            threadFactory,
            new ThreadPoolExecutor.AbortPolicy());

  我目前的措施是,修改了run方法的catch,捕获所有类型异常,但不处理异常,修改线程池配置如下:
  workerExecutor = new ThreadPoolExecutor(
            6,
            10,
            10,
            TimeUnit.SECONDS,
            new LinkedBlockingQueue<>(3000),
            threadFactory,
            new ThreadPoolExecutor.CallerRunsPolicy()

       不知道会不会有其他隐性问题,我对源代码没有深入去看
ichenyt commented 3 weeks ago

我看1.6版本已经修复了这个问题,解决方案类似,捕获所有异常,不处理异常