This is an epic aimed at encompassing the multitude of related issues encountered, discussed or proposed by either @dpb587, @mrdavidlaing or @sopel during the life time of this project.
While we've explored and integrated quite some metrics and monitoring components into the stack already, we are still falling short on following a metrics driven approach, see e.g. Metrics-Driven Development for what I mean here (not necessarily the entire article right now, just the perspective; the article is very good and worth a read though, much of what's described is very familiar, yet excellently summarized and reasoned about):
Nowadays releases in the IT world are becoming a matter of hours or even minutes. Everything is scaling up and down (vertically), to the right and to the left (horizontally). Therefore having a good monitoring system is a must. [...]
Granted, we all know, want and attempt this, and we also know that #monitoringsucks, which is why this remains to be a difficult and tedious task of stepwise improvements.
Accordingly, the current state of affairs leaves a lot to be desired, most notably the lack of consistent/reliable alerting, a performance test setup and/or a semantic SCM at least, which is required for correlating metrics with the deployed software stack.
Put another way, I perceive our approach to improving the cluster performance and reliability to be based more on heuristics rather than metrics (i.e. educated guesses and post hoc analysis of failures etc.); this obviously still yields noticeable improvements over time, but doesn't increase my confidence that we are 'in control of the operation'.
I'll stress again that we have touched on all those aspects already and @dpb587 in particular has made significant inroads regarding metrics collection and performance testing, which should enable us to improve quickly going forward.
:information_source: I'll link all related issues to this epic as I proceed converting my scribbles into something tangible.
This is an epic aimed at encompassing the multitude of related issues encountered, discussed or proposed by either @dpb587, @mrdavidlaing or @sopel during the life time of this project.
While we've explored and integrated quite some metrics and monitoring components into the stack already, we are still falling short on following a metrics driven approach, see e.g. Metrics-Driven Development for what I mean here (not necessarily the entire article right now, just the perspective; the article is very good and worth a read though, much of what's described is very familiar, yet excellently summarized and reasoned about):
Granted, we all know, want and attempt this, and we also know that #monitoringsucks, which is why this remains to be a difficult and tedious task of stepwise improvements.
I'll stress again that we have touched on all those aspects already and @dpb587 in particular has made significant inroads regarding metrics collection and performance testing, which should enable us to improve quickly going forward.
:information_source: I'll link all related issues to this epic as I proceed converting my scribbles into something tangible.