Netflix / Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
24.13k stars 4.71k forks source link

cpu% high #1424

Closed ssakshi0302 closed 7 years ago

ssakshi0302 commented 7 years ago

Hi, too many RXComputationalScheduler threads are getting created and my application's cpu% is going high after using hystrix.

I am using this configuration

"groupName": "recsagservice", "threadPoolCoreSize": 5, "cirBreakErrorThresPerc" : 50, "cirBreakReqVolThres":80, "cirBreakSleepTimeInMilliSec":72000 .

Please guide me what could be wrong here. I am attaching thread dump.

thread_hystrix.txt

kennedyoliveira commented 7 years ago

The RXComputationalScheduler has nothing to do with Hystrix, it's a RxJava thread, and you will has 1 for each CPU you has in your machine, so you probably has an 8-core machine that's why you get 8 threads.

About your cpu% going high you will need to provide more information about what you are doing because on the thread dump there are only threads waiting, no threads doing particularly nothing and shouldn't consume your cpu as you say.

ssakshi0302 commented 7 years ago

Hi,

Everytime my client method is called, I am just creating a new HystrixCommand object and calling its execute method. I have not called any reset method . This client method frequency is 45k times in an hour.

This is how I am using hystrix:

            Setter group =    Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("mygroup")).andCommandKey(HystrixCommandKey.Factory.asKey("myCommand")).andThreadPoolKey(                HystrixThreadPoolKey.Factory.asKey("myGroup")).andThreadPoolPropertiesDefaults(HystrixThreadPoolProperties.Setter().withCoreSize(5)).andCommandPropertiesDefaults(                HystrixCommandProperties.Setter().withCircuitBreakerErrorThresholdPercentage(50).withCircuitBreakerRequestVolumeThreshold(20).withCircuitBreakerSleepWindowInMilliseconds( 72000));

            HystrixExternalCommand myHystrixCommand = new HystrixCommand( group  );             
             myresponse = myHystrixCommand.execute();

From recording its looking like Stack Trace "rx.observers.Subscribers$5.onNext(Object)" is called 550 times in a min which is accounting for 38% of cpu usage.
recording

kennedyoliveira commented 7 years ago

@ssakshi0302 i think you are misinterpreting the percentage value, this % is not the total amount of CPU used overall, it's the total usage from the overall, i think i'm not explaining it clear, i'll try to exemplify: Let's say your code used 5% of your CPU, and looking at the table there it's saying 38% was spent on this method, so it's 38% from the 5%.

First, check in the General tab and see how the CPU was used, and after if it's really using too much CPU then you could check on the Code section.

ssakshi0302 commented 7 years ago

Hi ,the average cpu usage is 89.8%. As shown in the earlier sceenshot , rx.internal.operators are accounting for 38% of this 89.8% and also if wecalculate the total % by rx packages i.e (rx.internal.operators ,rx.subjects, rx.internal.util.unsafe, rx.observers) it will come around (38.25 + 14.395 + 7.928 + 2.92 ) i.e 63.4% of 89.8%.

kennedyoliveira commented 7 years ago

@ssakshi0302 maybe i'm missing something, but i can't see your cpu average in the screenshot, you should be looking at:

oracle_java_mission_control

ssakshi0302 commented 7 years ago

general_tab Here it is :)

kennedyoliveira commented 7 years ago

@ssakshi0302 now i see haha, thanks, this looks trick, what are you hystrix command doing? Executing some i/o task or something else?

ssakshi0302 commented 7 years ago

Its just a simple microservice call to a server . I am wondering if something is wrong with my metrics which is causing so . I am attaching memory screenshot also. You may find it useful. memory_usage

allocationbythread allocation_profile

ssakshi0302 commented 7 years ago

@kennedyoliveira : Any clue on this?

kennedyoliveira commented 7 years ago

@ssakshi0302 apologies for the delay, it's look trick to me, i'm not the best person to help with metrics, since i don't know the internals, @mattrjacobs can you help him with his doubt on it?

which hystrix version are you using?

ssakshi0302 commented 7 years ago

Hey I am using version 1.5.3

On Sun, 20 Nov 2016 at 2:01 AM, Kennedy Oliveira notifications@github.com wrote:

@ssakshi0302 https://github.com/ssakshi0302 apologies for the delay, it's look trick to me, i'm not the best person to help with metrics, since i don't know the internals, @mattrjacobs https://github.com/mattrjacobs can you help him with his doubt on it?

which hystrix version are you using?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Netflix/Hystrix/issues/1424#issuecomment-261737749, or mute the thread https://github.com/notifications/unsubscribe-auth/AImAlm3wBeD9BbbVEuhJSxhOLtmKzrdgks5q_1yWgaJpZM4K2s_L .

kennedyoliveira commented 7 years ago

@ssakshi0302 could you try to update to 1.5.8? I remember some weird problems in an old version, but i'm not sure if was about cpu performance.

ssakshi0302 commented 7 years ago

@kennedyoliveira : Thanks Oliveira , I will change the version but I wanted to understand why this could have occured and what does this stack indicate. Was it due to any flaw in metrics ?

ssakshi0302 commented 7 years ago

@mattrjacobs : Hi , What does the thread RxComputaionalScheduler do?I ould not find much in documentation. Can you please help us out here as we are facing production issues

kennedyoliveira commented 7 years ago

@ssakshi0302 i don't know if there are flaw in the metrics, because i don't know how the metrics works inside, i mean the implementation details, matt will help you better with that.

About the RxComputationalScheduler, is a fixed size scheduler with total threads equal as total core cpus on your machine, intended to be used for doing computational tasks, it's from RxJava.

In your screenshot from the allocation tab, could you print the stacktrace all the way up? Because it shows the internal operators calls but not who called it, may give a hint.

ssakshi0302 commented 7 years ago

@kennedyoliveira : Can you mention some email id , so that I can send flight recording itself. It will help you analyse better.

mattrjacobs commented 7 years ago

Rx Computation threads are set up on any JVM which uses RxJava. Since Hystrix uses RxJava, the threads get set up. Within Hystrix they are used for metrics collection/aggregation, which happens in the background. In our internal systems, we've never seen any performance issues from their usage.

In the attached thread dump, none of the RxComputation threads were busy, so that does not indicate over-usage of those threads.

Please attach a FlightRecorder trace to this issue and I can take a look.

ssakshi0302 commented 7 years ago

stack.txt

mattrjacobs commented 7 years ago

That is a stack trace for what is allocation memory to TLABs. From the rest of the discussion, I thought your concern was around CPU usage?

Can you please attach the entire jfr file?

ssakshi0302 commented 7 years ago

Hey, I am not able to attach any jfr file in github, I would be able to share that through email only. And i sent this memory allocation stack in reponse to kennedy's comment. And yes, the concern here is cpu usage :)

On Mon, Nov 21, 2016 at 11:46 PM, Matt Jacobs notifications@github.com wrote:

That is a stack trace for what is allocation memory to TLABs. From the rest of the discussion, I thought your concern was around CPU usage?

Can you please attach the entire jfr file?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Netflix/Hystrix/issues/1424#issuecomment-262021061, or mute the thread https://github.com/notifications/unsubscribe-auth/AImAll8cgnCqXFmpsM8WKFru9E-0iaxQks5rAd_9gaJpZM4K2s_L .

ssakshi0302 commented 7 years ago

flight_recording_3002371128084_6 (copy).txt

@mattrjacobs : Hi , I am attaching jfr file as txt file . Please change the extension to jfr and try

ssakshi0302 commented 7 years ago

@mattrjacobs : Any update ?

mattrjacobs commented 7 years ago

I took a look through the JFR you attached, and it looks as you described. A couple questions:

1) How many commands are in your system? 2) How many thread pools are in your system? 3) Have you changed any configuration from defaults (specifically around metrics)?

From my initial look, all of these code paths look valid, but it's not clear to me why they're taking as much CPU as they are. That hasn't been the case for the production system I work on, nor in the jmh tests that ship with Hystrix.

Could you run the jmh tests and take a flight-recorder trace during these?

mattrjacobs commented 7 years ago

Closing due to inactivity