Additional "openCircuitHostNames" metric in HystrixMetricsPoller?

Netflix / Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

24.15k stars 4.71k forks source link

Additional "openCircuitHostNames" metric in HystrixMetricsPoller? #271

Closed regunathb closed 9 years ago

regunathb commented 10 years ago

We have a Hystrix dashboard deployment that aggregates metrics from 100s of servers. While the dashboard shows counts of hosts where circuit is open, it would also be useful for it to provide host names. Is this a requirement for other deployments? How does Netflix identify hosts where circuits are open?

One approach that we implemented was to extend HystrixMetricsPoller to expose a "openCircuitHostNames" metric. This gets aggregated across hosts and is available in the dashboard for display. Is this additional metric worthy of adding to HystrixMetricsPoller?

Will be happy to submit a pull request with this code, if the idea is acceptable: https://github.com/Flipkart/phantom/blob/master/runtime/src/main/java/com/flipkart/phantom/runtime/impl/hystrix/HystrixMetricsPoller.java

benjchristensen commented 9 years ago

How does Netflix identify hosts where circuits are open?

We have not thus far tried in realtime. We have used our time-series system for breaking out hosts.

Is this additional metric worthy of adding to HystrixMetricsPoller?

My concern would be the increase in size of data. A cluster of 1000+ machines could easily then have 1000+ hostnames being transmitted when an outage happens. That is a non-trival addition of bytes to be sent every second.

mattrjacobs commented 9 years ago

@regunathb is this still something you're interested in? i'm trying to understand what you would do with this information if it was available. in general, when we see circuits trip in production, we like to know how many requests get shortcircuited but getting into the instance-level doesn't help us solve anything. can you help me understand?

regunathb commented 9 years ago

@mattrjacobs thanks for asking. Knowing the host names is pretty useful for us when failures are localized to hosts. We run a proxy (called Phantom) that wraps all outgoing calls with Hystrix commands, including those to other local daemons that are smart clients(shard aware) to distributed data stores. There have been instances when these daemons degrade on the data access path. Having just a cluster view on no. of failing calls makes debugging hard.

mattrjacobs commented 9 years ago

This is not something that would be useful to us at Netflix. If there's a way to implement this feature in a way that would solve your need and allow others to not use this feature, I'm happy to review it. For now, marking as 'wontfix'