Yelp / elastalert

Easy & Flexible Alerting With ElasticSearch
https://elastalert.readthedocs.org
Apache License 2.0
8k stars 1.73k forks source link

Metric Aggregation Not matching #1164

Open monitoringit opened 7 years ago

monitoringit commented 7 years ago

Hi, I have telegraf agent report metrics data to ELK 5.2.1 Doc in ES/Kibana looks like this: ======================Kibana copy/paste=============

 @timestamp     June 12th 2017, 14:13:00.000
t _id       AVydgkkTy_bElSPacHQt
t _index        telegraf-2017.06.12
# _score         - 
t _type     metrics
# disk.free     6,186,602,496
# disk.inodes_free      453,290
# disk.inodes_total     524,288
# disk.inodes_used      70,998
# disk.total        8,318,783,488
# disk.used     2,029,527,040
# disk.used_percent     24.702
t measurement_name      disk
t tag.AppId     DGRF
? tag.InstanceId          i-04706844b65d3c2d7
t tag.StackName     grafana-mon-GrafanaApp-15FIOD3L34PR
t tag.VPC       tools1tst
t tag.device        xvda1
t tag.fstype        ext4
t tag.host      ip-10-93-73-40
t tag.path      /

================================================================= I created a metric_aggregation rule but it doesnt match. Query hit shows right number of docs which tells it finds the message but fails to match:

type: metric_aggregation
index: telegraf-%Y.%m.%d
buffer_time:
  minutes: 2
metric_agg_key: disk.used_percent
metric_agg_type: avg
doc_type: metrics
max_threshold: 20
filter:
- term:
    measurement_name: "disk"

alert:
- sns
alert_subject: 'ELK Alert: "{1},{2},MINOR" on {0}'
alert_subject_args:
- tag.host
- tag.AppId
- tag.StackName

query_key:
- tag.host
- tag.VPC
- tag.StackName
- measurement_name
- tag.path
realert:
  minutes: 1
sns_topic_arn: arn:aws:sns:us-east-1:xxxxxxxxx:xxxxxxxxxxxxxxxxxxx
use_strftime_index: true

---------------------------On running elastalert-test-rule i getthis -----------------------------------

Got 1440 hits from the last 1 day

Available terms in first hit:
        @timestamp
        disk.used_percent
        disk.inodes_total
        disk.inodes_used
        disk.free
        disk.inodes_free
        disk.used
        disk.total
        tag.InstanceId
        tag.fstype
        tag.StackName
        tag.host
        tag.VPC
        tag.AppId
        tag.device
        tag.path
        measurement_name
Included term tag.host,tag.VPC,tag.StackName,measurement_name,tag.path may be missing or null

INFO:elastalert:Note: In debug mode, alerts will be logged to console but NOT actually sent. To send them, use --verbose.

Would have written the following documents to writeback index (default is elastalert_status):

elastalert_status - {'hits': 1440, 'matches': 0, '@timestamp': datetime.datetime(2017, 6, 12, 18, 27, 24, 492228, tzinfo=tzutc()), 'rule_name': 'Grafana Service is not running', 'starttime': datetime.datetime(2017, 6, 11, 18, 27, 20, 735265, tzinfo=tzutc()), 'endtime': datetime.datetime(2017, 6, 12, 18, 27, 20, 735265, tzinfo=tzutc()), 'time_taken': 3.696161985397339}
Qmando commented 7 years ago

Hmm.. I think the issue here is that you cannot use compound query_key with aggregations.

Try setting query_key: tag.host rather than host/path/vpc etc.

monitoringit commented 7 years ago

It worked when i use query_key: tag.host. But i also see that it fails to populate Subject properly: My rule:

alert_subject: 'ELK Alert: "{1},{2},MINOR" on {0}'
alert_subject_args:
- tag.host
- tag.AppId
- tag.StackName

Message posted to SNS ==> Subject = ELK Alert: ",,MINOR" on ip-10-93-73-40 Any ideas why values are not populated?

Qmando commented 7 years ago

Because when you are making aggregations, you don't get the values of those fields. It's taking an average across all StackNames and across all AppIDs. You only know the value of query_key.

monitoringit commented 7 years ago

ok makes sense! thanks for sharing the info. Is there plan for compound query_key for aggregation rule? Also can you refer doc link to me talking about all different aspects of aggregation rule? Thanks for being such a good Samaritan! :-) Appreciate your help

maayankestler commented 7 years ago

see #1328