Yelp / elastalert

Easy & Flexible Alerting With ElasticSearch
https://elastalert.readthedocs.org
Apache License 2.0
7.99k stars 1.74k forks source link

Duplicate mails for alerts #1294

Open Inkromind opened 7 years ago

Inkromind commented 7 years ago

We have setup several rules for alerts that should send a mail whenever it is triggered.

However, whenever a rule is triggered, it sends out 2 to 4 identical mails. The number of matches or hits reported does not match the number of mails send (sometimes it can, but other times there are eg 4 matches/hits with only 2 mails, or 4 mails with only 1 match/hit). Both aggregation and realert (on the log-field) are set to 1 min and work "properly" because multiple alerts are combined into a single mail (that is repeated 2 to 4 times) and for matches with the same log-filed, no new alert is sent within a minute. The number of alerts bundled by the aggregation is also irrelevant (eg 1 alert is sent 4 times, while 10 alerts are only sent twice).

All our rules are configured similary to this:

name: alertRule1
index: logstash-*
type: any
filter:
- and:
    - query:
        query_string:
            default_field: 'log'
            query: '*someQuery* AND NOT some filters'
realert:
    minutes: 1
query_key: 'log'
aggregation:
    minutes: 1
alert:
- "email"
alert_subject: 'ElastAlert: ALERT!'
email: "our@email.com"

The different rules only differ in query and name.

We are running the latest ElastAlert pulled from the master branch (1 instance) with 3 ES instances. run_every is configured to 1 minute and buffer_time is globally configured to 15 minutes.

ElastAlert is however throwing a bunch of errors:

17, 15:14:57.000        ProcessController:  ERROR:root:Failed to delete alert AV3wU03_0hgSx0jSdpKp at 2017-08-17T13:13:57.598681Z
August 17th 2017, 15:14:57.000      
August 17th 2017, 15:14:57.000      
August 17th 2017, 15:14:57.000      ProcessController:  WARNING:elasticsearch:DELETE http://elasticsearch-logging.tools.svc.cluster.local:9200/elastalert_status/elastalert/AV3wU03_0hgSx0jSdpKp [status:404 request:0.032s]
August 17th 2017, 13:11:29.000  11:11:29.861Z  INFO elastalert-server: Routes:  Successfully handled GET request for '/rules'.
August 17th 2017, 12:30:47.000  10:30:47.905Z ERROR elastalert-server:
August 17th 2017, 12:30:47.000  10:30:47.988Z ERROR elastalert-server:
August 17th 2017, 12:30:47.000      ProcessController:  WARNING:elasticsearch:DELETE http://elasticsearch-logging.tools.svc.cluster.local:9200/elastalert_status/elastalert/AV3vvQb20hgSx0jSaWe8 [status:404 request:0.008s]
August 17th 2017, 12:30:47.000      
August 17th 2017, 12:30:47.000      ERROR:root:Error fetching aggregated matches: TransportError(404, u'{"found":false,"_index":"elastalert_status","_type":"elastalert","_id":"AV3vve-ZzdiriMCxjrjC","_version":3,"_shards":{"total":2,"successful":2,"failed":0}}')
August 17th 2017, 12:30:47.000      ERROR:root:Failed to delete alert AV3vvQb20hgSx0jSaWe8 at 2017-08-17T10:29:49.006693Z
August 17th 2017, 12:30:47.000      
August 17th 2017, 12:30:47.000      ProcessController:  WARNING:elasticsearch:DELETE http://elasticsearch-logging.tools.svc.cluster.local:9200/elastalert_status/elastalert/AV3vve-ZzdiriMCxjrjC [status:404 request:0.012s]
August 17th 2017, 12:28:48.000          raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
August 17th 2017, 12:28:48.000      ProcessController:  ERROR:root:Error writing alert info to Elasticsearch: TransportError(400, u'illegal_argument_exception', u'[elasticsearch-logging-v1-1][100.127.192.1:9300][indices:data/write/index]')
August 17th 2017, 12:28:48.000        File "elastalert/elastalert.py", line 1377, in writeback
August 17th 2017, 12:28:48.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 73, in _wrapped
August 17th 2017, 12:28:48.000          self._raise_error(response.status_code, raw_data)
August 17th 2017, 12:28:48.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/base.py", line 125, in _raise_error
August 17th 2017, 12:28:48.000      
August 17th 2017, 12:28:48.000          doc_type=doc_type, body=body)
August 17th 2017, 12:28:48.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 312, in perform_request
August 17th 2017, 12:28:48.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 90, in perform_request
August 17th 2017, 12:28:48.000          status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
August 17th 2017, 12:28:48.000      
August 17th 2017, 12:28:48.000  10:28:48.974Z ERROR elastalert-server:
August 17th 2017, 12:28:48.000      Traceback (most recent call last):
August 17th 2017, 12:28:48.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/__init__.py", line 298, in index
August 17th 2017, 12:28:48.000          _make_path(index, doc_type, id), params=params, body=body)
August 17th 2017, 12:28:48.000          return func(*args, params=params, **kwargs)
August 17th 2017, 12:28:48.000  10:28:48.975Z ERROR elastalert-server:
August 17th 2017, 12:28:48.000      ProcessController:  WARNING:elasticsearch:POST http://elasticsearch-logging.tools.svc.cluster.local:9200/elastalert_status/silence [status:400 request:0.011s]
August 17th 2017, 12:28:48.000      RequestError: TransportError(400, u'illegal_argument_exception', u'[elasticsearch-logging-v1-1][100.127.192.1:9300][indices:data/write/index]')
August 17th 2017, 12:28:47.000      ProcessController:  WARNING:elasticsearch:POST http://elasticsearch-logging.tools.svc.cluster.local:9200/elastalert_status/silence [status:400 request:0.009s]
August 17th 2017, 12:28:47.000          return func(*args, params=params, **kwargs)
August 17th 2017, 12:28:47.000          _make_path(index, doc_type, id), params=params, body=body)
August 17th 2017, 12:28:47.000      Traceback (most recent call last):
August 17th 2017, 12:28:47.000          doc_type=doc_type, body=body)
August 17th 2017, 12:28:47.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 312, in perform_request
August 17th 2017, 12:28:47.000          self._raise_error(response.status_code, raw_data)

August 17th 2017, 12:28:47.000      ProcessController:  ERROR:root:Error writing alert info to Elasticsearch: TransportError(400, u'illegal_argument_exception', u'[elasticsearch-logging-v1-0][100.109.160.7:9300][indices:data/write/index]')
August 17th 2017, 12:28:47.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/__init__.py", line 298, in index
August 17th 2017, 12:28:47.000  10:28:47.670Z ERROR elastalert-server:
August 17th 2017, 12:28:47.000          raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
August 17th 2017, 12:28:47.000        File "elastalert/elastalert.py", line 1377, in writeback
August 17th 2017, 12:28:47.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 73, in _wrapped
August 17th 2017, 12:28:47.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/base.py", line 125, in _raise_error
August 17th 2017, 12:28:47.000      RequestError: TransportError(400, u'illegal_argument_exception', u'[elasticsearch-logging-v1-0][100.109.160.7:9300][indices:data/write/index]')
August 17th 2017, 12:28:47.000      
August 17th 2017, 12:28:47.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 90, in perform_request
August 17th 2017, 12:28:47.000  10:28:47.669Z ERROR elastalert-server:
August 17th 2017, 12:28:47.000      
August 17th 2017, 12:28:47.000          status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
August 17th 2017, 12:27:48.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/base.py", line 125, in _raise_error
August 17th 2017, 12:27:48.000        File "elastalert/elastalert.py", line 1377, in writeback
August 17th 2017, 12:27:48.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 73, in _wrapped
August 17th 2017, 12:27:48.000          return func(*args, params=params, **kwargs)
August 17th 2017, 12:27:48.000      
August 17th 2017, 12:27:48.000        File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_requests.py", line 90, in perform_request

We also use the kibana-plugin from BitSensor, so if it's possible some of the errorlogs above are from that.

sathishdsgithub commented 7 years ago

@inkromind

Please set the realert:0 and see if you stop getting duplicate emails.

also, can you set the num_events to some value and see if you get the alert email properly.

Inkromind commented 7 years ago

@sathishdsgithub With realert to 0 we get multiple alerts for similar matches, which is not something we want. We first had this and then we also got duplicate mails.

Isn't the num_events property only relevant when using the frequency-alert type? At least, that's what I get from the docs. We were using this before and set the num_events property to 1. Also no joy as it had the same effect as the any alert type we are using now.

That said, we have been tweaking the query over the past couple of days to filter out more false positives and somehow (we don't know why) the duplicate mails stopped... it seems that depending on which/how many results you get, the duplicate mails is triggered.

sathishdsgithub commented 7 years ago

@Inkromind

Can you share the rule here ?

sathishdsgithub commented 7 years ago

@Inkromind

Can you share the rule here ?

Inkromind commented 7 years ago

One of the rules started sending out multiple mails again. I did not change anything to the rule. This is the rule

name: Exception
index: logstash-*
type: any
filter:
- and:
    - query:
        query_string:
            default_field: 'log'
            query: 'container_name:"app" AND namespace_name:"test" AND *Exception* AND NOT "Invalid session ID" AND NOT Session AND NOT "token invalid" AND "java.net.UnknownHostException: local" AND NOT com.atomikos.icatch.RollbackException AND NOT "WARN ForbiddenException" AND NOT org.apache.camel.processor.validation.SchemaValidationException AND NOT com.atomikos.icatch.jta.TransactionImp.rethrowAsJtaRollbackException AND NOT javax.transaction.RollbackException AND NOT com.atomikos.icatch.RollbackException AND NOT "java.lang.IllegalArgumentException: No enum constant" AND NOT org.springframework.orm.jpa.JpaOptimisticLockingFailureException AND NOT org.eclipse.persistence.exceptions.OptimisticLockException AND NOT "One or more objects cannot be updated because it has changed" AND NOT "org.quartz.SchedulerException: Job threw an unhandled exception"'
realert:
    minutes: 1
query_key: 'log'
aggregation:
    minutes: 1
alert:
- "email"
alert_subject: 'ElastAlert: Exception'
email: "<obfuscated>"
tsboris commented 3 years ago

HI,

@Inkromind

Was this ever resolved? If it was, what was the solution?

Thanks :-)