census-instrumentation / opencensus-java

A stats collection and distributed tracing framework
https://opencensus.io
Apache License 2.0
674 stars 201 forks source link

100% cpu usage by OpenCensus.Disruptor-0 #1599

Open danielnorberg opened 5 years ago

danielnorberg commented 5 years ago

Please answer these questions before submitting a bug report.

What version of OpenCensus are you using?

0.17.0

What JVM are you using (java -version)?

java version "11.0.1" 2018-10-16 LTS Java(TM) SE Runtime Environment 18.9 (build 11.0.1+13-LTS) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.1+13-LTS, mixed mode)

What did you do?

Normal opencensus use - low volume workload.

What did you expect to see?

Low cpu usage.

What did you see instead?

The OpenCensus.Disruptor-0 thread spinning and using a whole core even when the application is otherwise idle.

This manifested after we upgraded from Java 8 to Java 11.

songy23 commented 5 years ago

@danielnorberg @steveniemitz I've made a new SNAPSHOT release which included https://github.com/census-instrumentation/opencensus-java/pull/1605. Can you try it and see if it resolved this issue?

The SNAPSHOT release is under https://oss.sonatype.org/content/repositories/snapshots/io/opencensus/. For example https://oss.sonatype.org/content/repositories/snapshots/io/opencensus/opencensus-api/0.18.0-SNAPSHOT/.

To access it, please add the following lines to ~/.m2/settings.xml:

<profiles>
  <profile>
     <id>allow-snapshots</id>
        <activation><activeByDefault>true</activeByDefault></activation>
     <repositories>
       <repository>
         <id>snapshots-repo</id>
         <url>https://oss.sonatype.org/content/repositories/snapshots</url>
         <releases><enabled>false</enabled></releases>
         <snapshots><enabled>true</enabled></snapshots>
       </repository>
     </repositories>
   </profile>
</profiles>

Then you should be able to use 0.18.0-SNAPSHOT as the OpenCensus version in your gradle/pom file.

danielnorberg commented 5 years ago

With opencensus-impl-0.18.0-20181121.224929-2.jar the OpenCensus.Disruptor-0 thread cpu usage has dropped down to around %10 for me on java 11.

bogdandrutu commented 5 years ago

@danielnorberg good to hear that. I think we should increase from 1ms to 4 or 8ms the sleeping time. Need to run some benchmarks to prove does not affect the high qps jobs. If you can try to increase that to 4 and 8 and see the impact would be great.

danielnorberg commented 5 years ago

@bogdandrutu I tried increasing the sleep from 1ms to 4ms and 8ms and got ~4% and ~2% cpu usage, respectively, when idle.

bogdandrutu commented 5 years ago

Do you think for the moment until we do a better fix 4% is acceptable? If yes I can do a PR or you can send me a PR. Thanks

steveniemitz commented 5 years ago

I've been running a custom build that replaced SleepingWaitStrategy with BlockingWaitStrategy, which completely removes the CPU overhead of the disruptor thread. However is does add a very slight overhead to the actual tracing calls (since they now need to synchronize/notify the disruptor). For my use-case this overhead was acceptable and better than burning a lot of CPU.

bogdandrutu commented 5 years ago

@danielnorberg @steveniemitz that would be another option. we can do that by having a manifest file to configure the "low-overhead" version which uses the BlockingWaitStrategy (or an environment variable). Do you prefer this? If so what do you think is better manifest or env variable? Also I will work to build a better implementation that will not require this.

steveniemitz commented 5 years ago

I do like the idea of being able to configure this more easily. It seems like the SleepingWaitStrategy does have some kinks in it that could be worked out though with some more tuning? It seems like even 4% is a very high overhead for this that most service owners wouldn't want. It also seems like things running in GKE (or AppEngine) are significantly more effected than things that aren't, are there specific kernel modifications there that are causing these issues?

EDIT: To directly address your question, I think I did see a PR around here that allowed configuring the strategy via a ServiceLoader (and manifest). That'd probably be the best solution.

bogdandrutu commented 5 years ago

Filed an issue on disruptor to allow us to support a better waiting strategy (this is for my long term solution) https://github.com/LMAX-Exchange/disruptor/issues/246. This is mostly an FYI and to keep track of the progress here.

@steveniemitz that PR has not made too much progress. Happy to start a new PR for this if you think that can be a good short term solution (maybe can still be useful after I can implement my long term solution).

danielnorberg commented 5 years ago

Allowing users to choose the appropriate wait strategy seems like it would be appreciated by some, although I suspect many application developers might not be familiar enough with disruptor and busy-wait/park/yield strategies, contention costs etc to be comfortable with making this choice. I guess most users will want a non-surprising default that performs well enough in most cases.

Defaulting to BlockingWaitStrategy seems like a good choice that would mitigate the unexpected cpu usage of the consumer thread for applications with low-volume workloads. The need for producer threads to synchronize is unfortunate though. Seeing as we have a single consumer thread, perhaps we can get away with a cheaper signaling mechanism. Took a stab at that here https://github.com/census-instrumentation/opencensus-java/pull/1618.

xocasdashdash commented 4 years ago

We've recently managed to hit a very similar issue to this one at my company. The difference is that we're running java8, the issue showed up when we updated from jdk:8u232 to jdk:8u242.

We're going to try to fix by updating the opencensus library.