Netflix / Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
24.07k stars 4.7k forks source link

What would you like to see in the 1.4 wiki documentation? #699

Closed DavidMGross closed 9 years ago

DavidMGross commented 9 years ago

What is new or has changed in the 1.4 release that needs to be explained in the wiki docs?

Are there other changes that have been made along the way that need documentation?

Are there other areas in the wiki documentation that need improvement?

If you think of anything along these lines, please make note of it here and I'll try to address it. I've been working with Netflix's documentation for a few years now (e.g. API.Next, RxJava), but have only recently started to familiarize myself with Hystrix.

mattrjacobs commented 9 years ago

Thanks, @DavidMGross ! I really appreciate it.

Overall, the major change from 1.3 to 1.4 is the addition of HystrixObservableCommand. The API and semantics of HystrixCommand should be unchanged (modulo bugfixes), even though the guts got totally rewritten.

https://github.com/Netflix/Hystrix/wiki/How-it-Works describes the internals of HystrixCommand and what flows are possible. The major change here is that instead of a run() method that returns a single-valued T and a getFallback() method that does the same, that is now abstracted away to an execution observable, and a fallback observable, each of which returns an Observable<T>. A HystrixCommand ends up being a sequence of Rx operators wired together. The diagram should reflect that new implementation, whether it's via an update to the current one, or possibly some sort of marble diagram. Any thoughts on what would work best from a presentation point-of-view, @DavidMGross ?

https://github.com/Netflix/Hystrix/wiki/How-To-Use looks great, and it looks like you've already explained motivation for HystrixObservableCommand and its usage. I think adding some HystrixObservableCommand cases to "Common Patterns" would be helpful, as well.

Some patterns could include:

Any other ideas, @benjchristensen ?

DavidMGross commented 9 years ago

I don't think a marble diagram is going to do us much good for explaining the whole flow, since so much of that has to do with things like doOnEach/Next/Completed/Terminate where marble diagrams aren't very evocative. They might be helpful for illustrating some specific patterns. Here's another take on the flow chart: hystrix command flow chart

mattrjacobs commented 9 years ago

Here are a couple places to clean up:

1) The API of HystrixObservableCommand is just observe() and toObservable().
2) 'Scheduler rejected' isn't exactly right. It's still semaphore and threadpool rejected, just like in 1.3.x 3) 'circuit-breaker thrown' should be changed to 'circuit-breaker open' 4) For me, the graphical layout of the upper-left is somewhat confusing. I think that it is helpful to talk about the interrelationship of execute(), queue(), observe(), and toObservable(), but this layout is hard to follow for me.

Something that might work better is emphasis that all operations are now implemented in terms of toObservable(). As you've described, observe() is just toObservable().subscribe() into a ReplaySubject, queue() is just toObservable().toBlocking().toFuture() and execute() is just toObservable().toBlocking().toFuture().get(). Those to me are somewhat distinct from the Hystrix flow describing the state machine of a command, so I think it might make sense to have an auxiliary table describing the API for HystrixCommand and HystrixObservableCommand and just include toObservable() in the Hystrix flow diagram. I think that would help explain the usage and internals while not mixing the 2 unduly. What do you think?

As always, thanks for the help and making this concrete.

DavidMGross commented 9 years ago

Here's a less-cluttered version: hystrix command flow chart

mattrjacobs commented 9 years ago

Yeah, I feel that reads much better. Thanks for the edits. One thing I now realize that is left out is timeouts. I think that's an important enough concept to merit inclusion. That can likely be represented as a state close to 'Success?' (probably after to make the most sense) that also has an arrow to the fallback() path.

Nice work!

DavidMGross commented 9 years ago

Good idea. A new version of the flowchart with the timeout possibility included is now live at https://github.com/Netflix/Hystrix/wiki/How-it-Works

benjchristensen commented 9 years ago

Here is the original Omnigraffle file for the How It Works diagram: https://www.dropbox.com/s/yu47ssj0mpe9r5q/hystrix-how-it-works.graffle?dl=0

ikolomiets commented 9 years ago

At least javadoc needs to be updated to 1.4. http://netflix.github.io/Hystrix/javadoc/index.html?com/netflix/hystrix/HystrixObservableCommand.html returns 404 :(

mattrjacobs commented 9 years ago

@ikolomiets Sorry about the oversight. We tweaked the build/release config as part of the 1.4 release and I missed this.

I'll take a look at this

mattrjacobs commented 9 years ago

@quidryan / @rspieldenner Is there any config I'm missing around publishing Javadocs to netflix.github.io?

quidryan commented 9 years ago

I don't believe that's is inherently in the build anymore. We hadn't heard or seen anyone doing it, so I don't think it made it into the default build. But there's nothing stopping you from using the functionality in the gradle-git: https://github.com/ajoberstar/gradle-git/wiki/org.ajoberstar.github-pages

mattrjacobs commented 9 years ago

Thanks @quidryan for the recommendation. Will get that integrated ASAP.

benjchristensen commented 9 years ago

It has never worked for me so I manually do it as I've never succeeded in shaving this yak.

benjchristensen commented 9 years ago

Here is how I've been generating it so we get only the packages we want:

javadoc -windowtitle "Hystrix Javadoc 1.3.9" 
-sourcepath /Users/bechristensen/development/github/HystrixOrigin/hystrix-core/src/main/java/
 -d /Users/bechristensen/development/github/HystrixPagesOrigin/javadoc 
-stylesheetfile /Users/bechristensen/development/github/javadocStyleSheet.css  
-top "<a href='https://github.com/Netflix/Hystrix'><img width='92' height='79' border='0' align='left' src='http://netflix.github.com/Hystrix/images/hystrix-logo-small.png'></a><h2 class='title' style='padding-top:40px'>Hystrix: Latency and Fault Tolerance for Distributed Systems</h2>" 
-doclet org.benjchristensen.doclet.DocletExclude 
-docletpath /Users/bechristensen/development/github/doclet-exclude.jar 
-classpath [classpath-here] 
com.netflix.hystrix com.netflix.hystrix.exception com.netflix.hystrix.strategy
com.netflix.hystrix.strategy.concurrency com.netflix.hystrix.strategy.eventnotifier 
com.netflix.hystrix.strategy.metrics com.netflix.hystrix.strategy.properties 
com.netflix.hystrix.strategy.executionhook
mattrjacobs commented 9 years ago

@ikolomiets I got this uploaded now. Now that it's there, please let me know if you see any areas you think the documentation can be improved.

ikolomiets commented 9 years ago

@mattrjacobs @benjchristensen - guys, you rock! NetflixOSS documentation is already one the best out there :)

infomaven commented 9 years ago

I'm working with a team that has an online application (a REST api) and they are currently not using fallbacks. As far as I can tell, they don't have plans to retry requests at a later time either. They just want to "fire and forget" the request, so they are catching and logging exceptions thrown by HystrixCommand objects.

I've been following your improvements on the wiki's flowchart for Hystrix and found it very useful for understanding how everything works. However when I tried to follow through for this property setting (disabled fallback) , I wasn't quite sure how to proceed. url > https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-command-flow-chart.png

Could we add this to the wiki somewhere? I really like what you've been doing with your docs.

mattrjacobs commented 9 years ago

I'm not quite clear on what you're asking. Could you be more specific?

infomaven commented 9 years ago

I am looking to understand better how Hystrix works and behaves when fallback is NOT enabled.

DavidMGross commented 9 years ago

I've added a bit more detail here: https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow8

Does that adequately explain things or are there more aspects that need to be covered?

On Sun, Mar 8, 2015 at 12:20 AM, Nadine Whitfield notifications@github.com wrote:

I am looking to understand better how Hystrix works and behaves when fallback is NOT enabled.

— Reply to this email directly or view it on GitHub https://github.com/Netflix/Hystrix/issues/699#issuecomment-77738579.

David M. Gross PLP Consulting

infomaven commented 9 years ago

Hi, apologies for not getting back to you sooner.

Yes, this does help explain things. One more thing that might be useful is to give a few brief examples or use cases where it might be appropriate to not define a fallback.

So, if I am using .execute() for my Hystrix Command, and it throws an exception because there was no fallback defined, would that exception impact my Hystrix stats and circuit breaker status in the normal way, or would it just exit immediately?

mattrjacobs commented 9 years ago

Here are 2 examples of cases where not defining a fallback is appropriate:

1) A write. If you've got a HystrixCommand<Void> that does a write (over HTTP/DB/Cache etc), then the only reasonable fallback, given the type signature, is return null. That relays 0 info, so the caller doesn't know if the write succeeded/failed, which is bad.

Note that a better design in this case would be to use a HystrixCommand<Boolean>. In the run() method, don't catch an Exception. Just return true after the write completes. Return false in the fallback. That way, the caller gets notified about a write failure and can update it's knowledge/state accordingly.

2) Batch systems/offline compute: When filling up a cache/generating a report/doing any sort of offline computation, it's usually more appropriate to retry on failure than to accept a silently-degraded response.

When you call execute() on a command without fallback, then that manifests to the caller as an Exception. It does update all of the usual Hystrix state, and circuit-breaker state/metrics are updated accordingly.