Comcast / jrugged

A Java libary of robustness design patterns
Apache License 2.0
265 stars 93 forks source link

CircuitBreaker not thread safe and gets permanently stuck in a HALF_CLOSED state #45

Closed marto closed 5 years ago

marto commented 8 years ago

We have been using the library in production and found a few issues with the CircuitBreaker class implementation not being thread safe.

Looking at the source, I believe there are a number of concurrency issues with the class around the manipulation of state, isAttemptLive, isHardTrip, byPass fields. From my experience it looks like it's not thread safe, although hard to describe why. Here is an attempt to describe what we observed in a production scenario:

On occasion when the caller is highly concurrent (i.e. runs on many threads) the CircuitBreaker get's stuck in a HALF_CLOSED state. It's stuck in a HALF_CLOSED state because the isAttemptLive field is not atomic and is flipped in both the close, trip methods (when called from a successful or a unsuccessful invoke method) while at the same time (or out of order) the synchronized canAttempt also manipulates the isAttemptLive state.

It's possible for the close or trip methods to be called out of order to the synchronized canAttempt method. This can cause a situation where sate=HALF_CLOSED, isAttemptLive=true but the actual request has completed. Thus the isAttemptLive=true will never be flipped back as there is not attempt being made. The result is that the CB remains forever in a HALF_CLOSED state. I have verified existence of this situation of CircuitBreaker being in this state once in a debugging situation (jrugged-core version = 3.2.2).

joercampbell commented 8 years ago

Yeah - I think someone else was having a very similar problem with the codebase. What version are you using in your environment?

Also - can you pull the recent master and see if the recent changes made around this solve the problem you are describing? We made some minor changes as a result of someone elses commenting about a similar problem where the isAttemptLive was being manipulated incorrectly when it didn't need to be causing inconsistent behavior.

Joe

On Wed, Mar 30, 2016 at 10:51 AM Martin Petrovsky notifications@github.com wrote:

We have been using the library in production and found a few issues with the CircuitBreaker class implementation not being thread safe.

Looking at the source, I believe there are a number of concurrency issues with the class around the manipulation of state, isAttemptLive, isHardTrip, byPass fields. From my experience it looks like it's not thread safe, although hard to describe why. Here is an attempt to describe what we observed in a production scenario:

On occasion when the caller is highly concurrent (i.e. runs on many threads) the CircuitBreaker get's stuck in a HALF_CLOSED state. It's stuck in a HALF_CLOSED state because the isAttemptLive field is not atomic and is flipped in both the close, trip methods (when called from a successful or a unsuccessful invoke method) while at the same time (or out of order) the synchronized canAttempt also manipulates the isAttemptLive state.

It's possible for the close or trip methods to be called out of order to the synchronized canAttempt method. This can cause a situation where sate=HALF_CLOSED, isAttemptLive=true but the actual request has completed. Thus the isAttemptLive=true will never be flipped back as there is not attempt being made. The result is that the CB remains forever in a HALF_CLOSED state. I have verified existence of this situation of CircuitBreaker being in this state once in a debugging situation.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/Comcast/jrugged/issues/45

bstaheri commented 7 years ago

It seems that the changes in class CircuitBreaker in revision #37 solve the problem. Do you have a plan to release a version? Because having this issue, the library cannot be used in production; every time isAttemptLive is incorrectly set to true, the software enters an incorrect final state and needs a restart, which is not acceptable in production environment.

joercampbell commented 5 years ago

Releasing a new version here shortly.