Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.43k stars 1.07k forks source link

Flood of "Skipping alert check because of exception" #3123

Closed hc4 closed 4 years ago

hc4 commented 8 years ago

When ES claster is not available, I got tons of errors about skipping alerts check

2016-11-20T03:15:12.692+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.onFirstPhaseResult(AbstractSearchAsyncAction.java:206) ~[graylog.jar:?]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.performFirstPhase(AbstractSearchAsyncAction.java:141) ~[graylog.jar:?]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:126) ~[graylog.jar:?]
    at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:115) ~[graylog.jar:?]
    at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:47) ~[graylog.jar:?]
    at org.elasticsearch.action.support.TransportAction.doExecute(TransportAction.java:149) ~[graylog.jar:?]
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:137) ~[graylog.jar:?]
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:85) ~[graylog.jar:?]
    at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58) ~[graylog.jar:?]
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359) ~[graylog.jar:?]
    at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:52) ~[graylog.jar:?]
    at org.graylog2.indexer.elasticsearch.GlobalTimeoutClient.doExecute(GlobalTimeoutClient.java:58) ~[graylog.jar:?]
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359) ~[graylog.jar:?]
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:348) ~[graylog.jar:?]
    at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:577) ~[graylog.jar:?]
    at org.graylog2.indexer.elasticsearch.GlobalTimeoutClient.search(GlobalTimeoutClient.java:64) ~[graylog.jar:?]
    at org.graylog2.indexer.searches.Searches.count(Searches.java:180) ~[graylog.jar:?]
    at org.graylog2.alerts.types.MessageCountAlertCondition.runCheck(MessageCountAlertCondition.java:102) ~[graylog.jar:?]
    at org.graylog2.alerts.types.MessageCountAlertCondition.runCheck(MessageCountAlertCondition.java:45) ~[graylog.jar:?]
    at org.graylog2.alerts.AlertServiceImpl.triggered(AlertServiceImpl.java:236) ~[graylog.jar:?]
    at org.graylog2.periodical.AlertScannerThread.doRun(AlertScannerThread.java:95) [graylog.jar:?]
    at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:77) [graylog.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_91]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_91]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_91]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_91]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]

This is search for such errors. They happend many times per second.

    Line 1055: 2016-11-20T03:15:12.685+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1086: 2016-11-20T03:15:12.686+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1117: 2016-11-20T03:15:12.687+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1148: 2016-11-20T03:15:12.688+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1179: 2016-11-20T03:15:12.690+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1210: 2016-11-20T03:15:12.690+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1241: 2016-11-20T03:15:12.692+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1272: 2016-11-20T03:15:12.692+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1303: 2016-11-20T03:15:12.694+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1334: 2016-11-20T03:15:12.695+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1365: 2016-11-20T03:15:12.696+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1396: 2016-11-20T03:15:12.697+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1427: 2016-11-20T03:15:12.700+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1458: 2016-11-20T03:15:12.700+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1489: 2016-11-20T03:15:12.702+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1520: 2016-11-20T03:15:12.702+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1551: 2016-11-20T03:15:12.704+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1582: 2016-11-20T03:15:12.705+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1613: 2016-11-20T03:15:12.706+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1644: 2016-11-20T03:15:12.707+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1675: 2016-11-20T03:15:12.709+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1706: 2016-11-20T03:15:12.709+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1737: 2016-11-20T03:15:12.711+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1768: 2016-11-20T03:15:12.711+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1799: 2016-11-20T03:15:12.713+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1830: 2016-11-20T03:15:12.714+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1861: 2016-11-20T03:15:12.715+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1892: 2016-11-20T03:15:12.716+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1923: 2016-11-20T03:15:12.717+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1954: 2016-11-20T03:15:12.718+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 1985: 2016-11-20T03:15:12.719+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2016: 2016-11-20T03:15:12.733+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2047: 2016-11-20T03:15:12.735+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2078: 2016-11-20T03:15:12.735+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2109: 2016-11-20T03:15:12.737+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2140: 2016-11-20T03:15:12.737+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2171: 2016-11-20T03:15:12.739+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2202: 2016-11-20T03:15:12.740+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2233: 2016-11-20T03:15:12.741+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2264: 2016-11-20T03:15:12.742+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2295: 2016-11-20T03:15:12.744+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2326: 2016-11-20T03:15:12.744+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2357: 2016-11-20T03:15:12.746+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2388: 2016-11-20T03:15:12.747+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2419: 2016-11-20T03:15:12.748+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2450: 2016-11-20T03:15:12.749+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2481: 2016-11-20T03:15:12.750+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2512: 2016-11-20T03:15:12.751+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2543: 2016-11-20T03:15:12.753+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2574: 2016-11-20T03:15:12.753+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2605: 2016-11-20T03:15:12.755+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2636: 2016-11-20T03:15:12.755+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2667: 2016-11-20T03:15:12.757+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2698: 2016-11-20T03:15:12.758+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2729: 2016-11-20T03:15:12.759+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2760: 2016-11-20T03:15:12.759+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2791: 2016-11-20T03:15:12.761+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2822: 2016-11-20T03:15:12.762+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2853: 2016-11-20T03:15:12.763+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2884: 2016-11-20T03:15:12.764+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2915: 2016-11-20T03:15:12.766+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2946: 2016-11-20T03:15:12.766+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 2977: 2016-11-20T03:15:12.768+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3008: 2016-11-20T03:15:12.768+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3039: 2016-11-20T03:15:12.770+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3070: 2016-11-20T03:15:12.771+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3101: 2016-11-20T03:15:12.772+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3132: 2016-11-20T03:15:12.772+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3163: 2016-11-20T03:15:12.774+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3194: 2016-11-20T03:15:12.775+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3225: 2016-11-20T03:15:12.776+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3256: 2016-11-20T03:15:12.777+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3287: 2016-11-20T03:15:12.778+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3318: 2016-11-20T03:15:12.779+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3349: 2016-11-20T03:15:12.780+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3380: 2016-11-20T03:15:12.781+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3411: 2016-11-20T03:15:12.783+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3442: 2016-11-20T03:15:12.783+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3473: 2016-11-20T03:15:12.785+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3504: 2016-11-20T03:15:12.785+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3535: 2016-11-20T03:15:12.787+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3566: 2016-11-20T03:15:12.788+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3597: 2016-11-20T03:15:12.789+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3628: 2016-11-20T03:15:12.790+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3659: 2016-11-20T03:15:12.791+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3690: 2016-11-20T03:15:12.792+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3721: 2016-11-20T03:15:12.793+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3752: 2016-11-20T03:15:12.794+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3783: 2016-11-20T03:15:12.796+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3814: 2016-11-20T03:15:12.796+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3845: 2016-11-20T03:15:12.798+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3876: 2016-11-20T03:15:12.798+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3907: 2016-11-20T03:15:12.800+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3938: 2016-11-20T03:15:12.801+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 3969: 2016-11-20T03:15:12.802+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 4000: 2016-11-20T03:15:12.802+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 4031: 2016-11-20T03:15:12.804+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 4062: 2016-11-20T03:15:12.805+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 4093: 2016-11-20T03:15:12.806+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 4124: 2016-11-20T03:15:12.807+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 4155: 2016-11-20T03:15:12.808+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 4186: 2016-11-20T03:15:12.809+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.
    Line 4217: 2016-11-20T03:15:12.810+03:00 ERROR [AlertScannerThread] Skipping alert check that threw an exception.

Expected Behavior

There is must be some grace period, because currently it seems that Graylog DDOS'ing my ES cluster :)

Your Environment

kroepke commented 7 years ago

The message is logged once per alert. When Elasticsearch is not available this is an exceptional circumstance and we log loudly. That's very unlikely to change.

hc4 commented 7 years ago

There is no sense to log it thousand times per second :)

kroepke commented 7 years ago

Fair enough

hc4 commented 7 years ago

Maybe I'm wrong, but it seems that graylog after error tries to send new request without any grace period. And this leads to many errors. Adding pause before resending request after error will solve problem.

ps. why is there still tag "won't fix"?

kroepke commented 7 years ago

Because I forgot to remove it :) The error comes out of the inner loop and should only be logged once per configured alert. We'll look into it during the beta phase.

dennisoelkers commented 7 years ago

What I found out is that the error message is printed for each configured and active (the stream it is created for is not paused) alert condition once per alert scanner run. As we do not know if an alert condition requires ES or anything else (as conditions are pluggable now) there is no easy way at the moment to reduce the amount of logging. A solution might be to introduce a circuit breaker for the indexer, which is beyond the scope of 2.2.

dennisoelkers commented 7 years ago

@hc4: how many alert conditions do you have configured? Are you running the alert checking at a non-default interval?

hc4 commented 7 years ago

I have 1-2 alerts per stream with total count of alerts about 10. Alert checking interval is default

dennisoelkers commented 7 years ago

How many streams do you have in total?

hc4 commented 7 years ago

8, but not all of them have alerts

bernd commented 4 years ago

This should no longer be an issue with the new alerting and the events system. Please open a new issue if this is still a problem with the new system. Thank you!