Downtime for SLA-Reporting

icefish-creativ commented 5 years ago

Uptime is a really great feature.i think it is useful for everybody or rather essential for SLA-Reporting. it would be nice you could set a SLA-Value , so you could also put a monitor on it and see if the service level are violated. And then i could finally throw Nagios/Icinga out the window.

a list of downtime per host/services
via host/service I can set an SLA like 99,9;99,5 etc.
SLA reporting where downtimes are calculated automatically (GUI vor Live and PDF)

cheers

Tim

andrewvc commented 5 years ago

Thanks for posting this Tim. This is a great idea, and honestly not a heavy lift item as well.

I think a next step for this would be to create mocks.

@dov0211 @justinkambic @makwarth what are your thoughts for this feature?

From a priority perspective, this feels lower than central management and alerting, so it's probably a ways down the road for now.

makwarth commented 5 years ago

Agree this would be a great addition. Thanks for posting, @icefish-creativ. I wonder how SLA per service would work with Heartbeat. Grouping of monitors?

dov0211 commented 5 years ago

We had similar feedback from our SA team, (SLAs & integration within Observability solutions). I agree with @andrewvc to consider those 2 items straight after Central management and alerting. several vendors provide rich capabilities in terms of SLAs calculation (Different KPIs, different calendars and working hours, downtimes, and more) I believe we should start with calculating endpoint availability as a first phase, and think of those in the aspects of reporting and alerting.

icefish-creativ commented 5 years ago

@makwarth my pleasure for example i have 4 Webserver in back of a Loadbalancer , so i need 5 monitors , 4 to Check the Server and 1 over the Loadbalancer with a Application Endpoint Check. I give the Customer a SLA on the Loadbalancer Check. the availability of the server itself is of secondary importance.

Set Downtime on Groups is of course great :-). groups should base on custom fields , for example i added every message the custom fields host.environment(like prod, test),host.role(like web server,mysql) and host.setup(foobar1,foobar2)

alogishetty commented 5 years ago

Hey Guys, can we get this feature implemented? We are looking for these kind of metrics for defining SLO's and SLA's.

dwchurch commented 5 years ago

Yes, uptime would be immensely more valuable with this feature.

TheSecMaven commented 5 years ago

This is a huge blocker for us in our implementation of uptime. We also are looking for these kind of metrics and think that this product can get far more value by providing these metrics.

firewallkevin commented 5 years ago

This is a superlatively useful feature that would extend elastic stack's use in IT Operations and analytics.

andrewvc commented 5 years ago

To add to this issue, some of the metadata that's a pre-req for calculating this is here: https://github.com/elastic/beats/pull/13672

I'm thinking we can add this along with this improvement in https://github.com/elastic/kibana/issues/44546 since the timeline calculation gives us that info for free more or less.

It would be great to hear from more people in this thread about what a downtime indicator would be used for.

Would you use it for defining breaches of contract? Purely for internal metrics with less strict guidelines? Something else? Would you use it from multiple geo locations?

alogishetty commented 5 years ago

We are mainly looking for internal metric with multiple Geo locations. Defining contract breaches will also be a great addition to this feature.

Thank you, Abhishek

On Tue, Sep 24, 2019 at 9:10 PM Andrew Cholakian notifications@github.com wrote:

To add to this issue, some of the metadata that's a pre-req for calculating this is here: elastic/beats#13672 https://github.com/elastic/beats/pull/13672

I'm thinking we can add this along with this improvement in elastic/kibana#44546 https://github.com/elastic/kibana/issues/44546 since the timeline calculation gives us that info for free more or less.

It would be great to hear from more people in this thread about what a downtime indicator would be used for.

Would you use it for defining breaches of contract? Purely for internal metrics with less strict guidelines? Something else? Would you use it from multiple geo locations?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/uptime/issues/15?email_source=notifications&email_token=AENOWB45FHID6PNFZBRBH5TQLLB7XA5CNFSM4HB3W4EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7QLFHY#issuecomment-534819487, or mute the thread https://github.com/notifications/unsubscribe-auth/AENOWB23HLXBJOBR3WGU7T3QLLB7XANCNFSM4HB3W4EA .

andrewvc commented 5 years ago

@alogishetty it'd be great to hear more details here about contract breaches and geo locations.

Are you looking for individual statistics per geo location?
How would you define contract breaches? It may be hard or impossible for us to support custom formulas for SLA.

alogishetty commented 5 years ago

For geo locations, we often loose connectivity between data centers, we would like to track how often we loose connectivity between data centers.

For contract breaches, we haven't considered until I saw it in your previous email. This is something we would like to start talking about it to our vendors and uptime will be a great tool to track breaches.

Regards, Abhishek

On Fri, Sep 27, 2019, 11:17 AM Andrew Cholakian notifications@github.com wrote:

@alogishetty https://github.com/alogishetty it'd be great to hear more details here about contract breaches and geo locations.

Are you looking for individual statistics per geo location?

How would you define contract breaches? It may be hard or impossible for us to support custom formulas for SLA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/elastic/uptime/issues/15?email_source=notifications&email_token=AENOWBZ7L4WXE7DVYOIOQV3QLYWZXA5CNFSM4HB3W4EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ZMW3Y#issuecomment-536005487, or mute the thread https://github.com/notifications/unsubscribe-auth/AENOWB4RJWY7T2IK5JAXU6LQLYWZXANCNFSM4HB3W4EA .

andrewvc commented 5 years ago

@alogishetty hmmm, how do you track connectivity between data centers? Do you have a heartbeat job in each DC that does nothing but ping the other? Or do all your monitors run once in each DC? Both?

andrewvc commented 4 years ago

Fixed in https://github.com/elastic/kibana/pull/67790 (targeting 7.9.0). If this doesn't resolve anyone in this thread's use cases feel free to open a new issue.

elastic / uptime

Downtime for SLA-Reporting #15