Add log analysis based alarms for encountered error conditions

sopel commented 11 years ago

This has been triggered by #90 and relates to #88 - @dpb587's comment 21685798 identifies yet another error condition which manifests itself with a specific signature in the logs, so in principle we should start adding a dedicated alarm for any such condition to avoid similar future regressions (similar to a repro case in unit testing).

Now, ironically we would probably prefer to base this on the solution at hand, but this obviously doesn't work ;)

:question: Would it make sense to stream the cluster logs to a Logging as a Service provider like Loggly, Papertrail or Splunk Strom and facilitate their searched save based alarm functionality instead?
:question: Alternatively we could run a second (and likely much smaller) cluster ourselves, which would imply mastering the desired functionality in a way reusable for our target audience too? Of course, this might yield some oddities regarding simultaneous error conditions, but this might also be avoidable with a respectively conservative update/change policy ..

mrdavidlaing commented 11 years ago

I propose we stream log events to SplunkStorm; as a way of keeping tabs on the competition :)

dpb587 commented 11 years ago

To be fair, elasticsearch does report back (visualized through Bigdesk) how many file descriptors it has open along with the max, so they can easily be quantified. In this case, I just didn't catch the discrepancy of the low max.

I do think we should stream the the elasticsearch + logstash + redis logs somewhere. It should be very low volume, so we could probably get by with somebody's free tier.

mrdavidlaing commented 11 years ago

SplunkStorm's free tier (< 1GB /month) should suffice?

sopel commented 11 years ago

Sounds like we are in agreement to stream the logs somewhere else, but a short summary of the desired alerting yields some questions:

Loggly
- :heavy_plus_sign: already in use successfully within other contexts
- :heavy_plus_sign: also has the most generous free trier and lowest pricing by far (depends on some details of your logging profile, but fairly significant in general) and also the most appropriate pay-as-you-go pricing from my PoV
- :question: offers Powerful Alerting and Troubleshooting by means of an external app only, see Alerting with Alert Birds, which uses their API, something that we could do as well of course, see e.g. Alerting with Loggly and Pagerduty
- :heavy_minus_sign: the external app only supports PagerDuty and HTTP endpoints currently, thus requires custom integration work for email support even
Papertrail Alerts
- :heavy_plus_sign: available and working out of the box from my experience
- :heavy_plus_sign: in particular it supports quite some services already, including CloudWatch
Splunk Storm Alerting
- :heavy_minus_sign: coming soon only, so pretty much a strike for the use case at hand

Let's talk about that in the next hangout.

mrdavidlaing commented 11 years ago

We could also just send logs to all three, in the spirit of evaluation.

sopel commented 11 years ago

The decision is to approach this one by one in the following order:

Loggly
- I've just noticed another potential :heavy_minus_sign: for Loggly though, insofar the saved search page claims This feature is currently in beta so we're limiting you to five saved searches per account.We'll nudge this up soon! - this is doubly odd, insofar the feature is fairly old already and also the base for alerting via the external Alert Birds app, thus rendering this pretty much unusable for production scenarios in case; but let's just look into it and see where we'll end up.
Papertrail
Splunk Storm

dpb587 commented 11 years ago

Theoretical logstash configuration committed; waiting on input key to test/merge/deploy.

dpb587 commented 11 years ago

The logstash loggly output seemed to perform extremely inefficiently, so I switched to rsyslog. The meta: system_app_loggly task is still committed and works if it's useful later. Tomorrow afternoon I'll push these changes to the ppe cluster.

dpb587 commented 11 years ago

This has been a low priority to me given I'm not quite sure how to make meaningful search queries without just making it noise. For example, a search in Loggly would generate occasional results with a general search for exception:

<134>Aug 16 09:41:29 ip-10-228-35-215 app-logstash_redis: {:timestamp=>"2013-08-16T09:41:21.565000+0000", :message=>"Failed parsing date from field", :field=>"datetime_tz", :value=>"%{datetime}+01:00", :exception=>java.lang.IllegalArgumentException: Invalid format: "%{datetime}+01:00", :level=>:warn}

Which isn't helpful since that error has been noticed elsewhere and I've created an issue for it on here already. Likewise, some of the nightly reboots cause exceptions due to Rakefile usage which aren't really newsworthy.

So, consider the issue updated, but still pending.

sopel commented 11 years ago

@dpb587 - I see the problem with noise regarding more generic error patterns, that's why I've mostly considered this to be a defense in depth approach regarding well known (or to be anticipated) special/rare error patterns like those (Too many open files) in #90 - I agree that this might be a bit too fine grained to be useful and should rather be covered by more general cluster health checks (which hadn't been in place back then).

sopel commented 11 years ago

Moved to Icebox due to low priority.

sopel commented 10 years ago

The Loggly based alarms are available now after upgrading to Loggly Gen2 via https://github.com/cityindex/logsearch/issues/235, see https://github.com/cityindex/logsearch-config/issues/57 for an example notification - currently there are only email, PagerDuty and HTTP POST endpoints available though, so a smooth integration into our current alarm/notification mix would require a resp. HTTP endpoint .

this is unrelated to LogSearch as such and will be handled via https://github.com/cityindex/labs-operations/issues/99 therefore

sopel commented 9 years ago

Closed as Won't Fix due to project being retired to the CityIndex Attic.

cityindex-attic / logsearch

Add log analysis based alarms for encountered error conditions #93