cityindex-attic / logsearch

[unmaintained] A development environment for ELK
Apache License 2.0
24 stars 8 forks source link

Add log analysis based alarms for encountered error conditions #93

Closed sopel closed 9 years ago

sopel commented 11 years ago

This has been triggered by #90 and relates to #88 - @dpb587's comment 21685798 identifies yet another error condition which manifests itself with a specific signature in the logs, so in principle we should start adding a dedicated alarm for any such condition to avoid similar future regressions (similar to a repro case in unit testing).

Now, ironically we would probably prefer to base this on the solution at hand, but this obviously doesn't work ;)

mrdavidlaing commented 11 years ago

I propose we stream log events to SplunkStorm; as a way of keeping tabs on the competition :)

dpb587 commented 11 years ago

To be fair, elasticsearch does report back (visualized through Bigdesk) how many file descriptors it has open along with the max, so they can easily be quantified. In this case, I just didn't catch the discrepancy of the low max.

I do think we should stream the the elasticsearch + logstash + redis logs somewhere. It should be very low volume, so we could probably get by with somebody's free tier.

mrdavidlaing commented 11 years ago

SplunkStorm's free tier (< 1GB /month) should suffice?

sopel commented 11 years ago

Sounds like we are in agreement to stream the logs somewhere else, but a short summary of the desired alerting yields some questions:

Let's talk about that in the next hangout.

mrdavidlaing commented 11 years ago

We could also just send logs to all three, in the spirit of evaluation.

sopel commented 11 years ago

The decision is to approach this one by one in the following order:

  1. Loggly
    • I've just noticed another potential :heavy_minus_sign: for Loggly though, insofar the saved search page claims This feature is currently in beta so we're limiting you to five saved searches per account.We'll nudge this up soon! - this is doubly odd, insofar the feature is fairly old already and also the base for alerting via the external Alert Birds app, thus rendering this pretty much unusable for production scenarios in case; but let's just look into it and see where we'll end up.
  2. Papertrail
  3. Splunk Storm
dpb587 commented 11 years ago

Theoretical logstash configuration committed; waiting on input key to test/merge/deploy.

dpb587 commented 11 years ago

The logstash loggly output seemed to perform extremely inefficiently, so I switched to rsyslog. The meta: system_app_loggly task is still committed and works if it's useful later. Tomorrow afternoon I'll push these changes to the ppe cluster.

dpb587 commented 11 years ago

This has been a low priority to me given I'm not quite sure how to make meaningful search queries without just making it noise. For example, a search in Loggly would generate occasional results with a general search for exception:

<134>Aug 16 09:41:29 ip-10-228-35-215 app-logstash_redis: {:timestamp=>"2013-08-16T09:41:21.565000+0000", :message=>"Failed parsing date from field", :field=>"datetime_tz", :value=>"%{datetime}+01:00", :exception=>java.lang.IllegalArgumentException: Invalid format: "%{datetime}+01:00", :level=>:warn}

Which isn't helpful since that error has been noticed elsewhere and I've created an issue for it on here already. Likewise, some of the nightly reboots cause exceptions due to Rakefile usage which aren't really newsworthy.

So, consider the issue updated, but still pending.

sopel commented 11 years ago

@dpb587 - I see the problem with noise regarding more generic error patterns, that's why I've mostly considered this to be a defense in depth approach regarding well known (or to be anticipated) special/rare error patterns like those (Too many open files) in #90 - I agree that this might be a bit too fine grained to be useful and should rather be covered by more general cluster health checks (which hadn't been in place back then).

sopel commented 11 years ago

Moved to Icebox due to low priority.

sopel commented 10 years ago

The Loggly based alarms are available now after upgrading to Loggly Gen2 via https://github.com/cityindex/logsearch/issues/235, see https://github.com/cityindex/logsearch-config/issues/57 for an example notification - currently there are only email, PagerDuty and HTTP POST endpoints available though, so a smooth integration into our current alarm/notification mix would require a resp. HTTP endpoint .

sopel commented 9 years ago

Closed as Won't Fix due to project being retired to the CityIndex Attic.