Closed sopel closed 9 years ago
I propose we stream log events to SplunkStorm; as a way of keeping tabs on the competition :)
To be fair, elasticsearch does report back (visualized through Bigdesk) how many file descriptors it has open along with the max, so they can easily be quantified. In this case, I just didn't catch the discrepancy of the low max.
I do think we should stream the the elasticsearch + logstash + redis logs somewhere. It should be very low volume, so we could probably get by with somebody's free tier.
SplunkStorm's free tier (< 1GB /month) should suffice?
Sounds like we are in agreement to stream the logs somewhere else, but a short summary of the desired alerting yields some questions:
Let's talk about that in the next hangout.
We could also just send logs to all three, in the spirit of evaluation.
The decision is to approach this one by one in the following order:
Theoretical logstash configuration committed; waiting on input key to test/merge/deploy.
The logstash loggly output seemed to perform extremely inefficiently, so I switched to rsyslog. The meta: system_app_loggly
task is still committed and works if it's useful later. Tomorrow afternoon I'll push these changes to the ppe cluster.
This has been a low priority to me given I'm not quite sure how to make meaningful search queries without just making it noise. For example, a search in Loggly would generate occasional results with a general search for exception
:
<134>Aug 16 09:41:29 ip-10-228-35-215 app-logstash_redis: {:timestamp=>"2013-08-16T09:41:21.565000+0000", :message=>"Failed parsing date from field", :field=>"datetime_tz", :value=>"%{datetime}+01:00", :exception=>java.lang.IllegalArgumentException: Invalid format: "%{datetime}+01:00", :level=>:warn}
Which isn't helpful since that error has been noticed elsewhere and I've created an issue for it on here already. Likewise, some of the nightly reboots cause exceptions due to Rakefile usage which aren't really newsworthy.
So, consider the issue updated, but still pending.
@dpb587 - I see the problem with noise regarding more generic error patterns, that's why I've mostly considered this to be a defense in depth approach regarding well known (or to be anticipated) special/rare error patterns like those (Too many open files)
in #90 - I agree that this might be a bit too fine grained to be useful and should rather be covered by more general cluster health checks (which hadn't been in place back then).
The Loggly based alarms are available now after upgrading to Loggly Gen2 via https://github.com/cityindex/logsearch/issues/235, see https://github.com/cityindex/logsearch-config/issues/57 for an example notification - currently there are only email, PagerDuty and HTTP POST endpoints available though, so a smooth integration into our current alarm/notification mix would require a resp. HTTP endpoint .
Closed as Won't Fix due to project being retired to the CityIndex Attic.
This has been triggered by #90 and relates to #88 - @dpb587's comment 21685798 identifies yet another error condition which manifests itself with a specific signature in the logs, so in principle we should start adding a dedicated alarm for any such condition to avoid similar future regressions (similar to a repro case in unit testing).
Now, ironically we would probably prefer to base this on the solution at hand, but this obviously doesn't work ;)