census-instrumentation / opencensus-java

A stats collection and distributed tracing framework
https://opencensus.io
Apache License 2.0
673 stars 202 forks source link

Disruptor thread thrashing in containers on OS X. #801

Closed hsyed closed 6 years ago

hsyed commented 6 years ago

What version of OpenCensus are you using?

0.8, combined with grpc-java 1.7.0 and the correct netty stack.

What JVM are you using (java -version)?

Current distroless openjdk java image (20% cpu core thrashing), alpine glibc oracle jdk 8 see.

What did you do?

ZPages is active along with "always sample" the server is idling in docker.

What did you expect to see?

< 0.10% core usage -- just as when running on OS X.

What did you see instead?

20% with openjdk and 100% with alpine-oracle of a core thrashing.

The project artefacts are build with Bazel. It's possible I might be missing something or that the slimmed down images are missing some native libraries ?

bogdandrutu commented 6 years ago

@hsyed thanks for reporting this. Would it be possible to share a hello-world up where I can reproduce/debug this problem (if you already have something).

Thanks

hsyed commented 6 years ago

So I am on high Sierra with a Mac Pro.

The problem shows up in two applications, both have a lot of DI and it is a bazel build, So it's hard to extract but the problem is simple enough to reproduce in the GRPC repo.

During the process of debugging I switched from distroless to our alpine glibc oracle jdk8 base image and saw the problem * 10 in JMX. The service is idling and the disruptor thread maxes out a core.

  // get some tracing going, run after grpc service is started.
 Tracing.getTraceConfig().updateActiveTraceParams(TraceParams.DEFAULT.toBuilder().setSampler(Samplers.alwaysSample()).build());
   Tracing.getExportComponent().getSampledSpanStore().registerSpanNamesForCollection(
    PciServiceGrpc.getServiceDescriptor().getMethods().stream().map(m -> m.getFullMethodName()).collect(Collectors.toList())

I unwired everything till the grpc AbstractServerImplBuilder::build(); call. I unhooked the TLS etc so it was just a ...forPort(123).build. I then removed The tracing setup block above and the ZPages module / zipkin module and took the opencensus impl jars out of the Bazel target. The problem goes away.

In addition to the image linked in the first post this is the OpenJDK base image from distroless. The quickest way to reproduce should be to add the rules_docker to the GRPC examples in the GRPC repo, along with the opencensus impl jars and just launch it. Might be an idea to start a Bazel workspace in this repo.

HailongWen commented 6 years ago

Hi @hsyed , I tried but could not reproduce the issue. Would you please help check what steps I missed?

I've done these on my Mac Pro Sierra 10.12.6:

  1. check out a new grpc-java 1.7.0 repo.
  2. Setup rules_docker under examples (changing existing WORKSPACE and BUILD.bazel).
  3. Explicitly add jars of impl, impl_core, disruptor, exporter-trace-logging and contrib-zpages (all with version 0.8.0) in the target.
  4. Instrument the code you provide in HelloWorldServer.java.
  5. Run the script to start the server in docker.

Nothing unusual is observed. I've also tried the alpine glibc oracle jdk 8 but everything is still good.

bogdandrutu commented 6 years ago

@hsyed ping on this

matthewrj commented 6 years ago

I am experiencing a similar issue. I experience this on my Ubuntu machine and also on linux boxes in the cloud (https://cloud.google.com/container-optimized-os/docs/). I have created a minimal project to reproduce the issue here.

According to the LMAX docs, SleepingWaitStrategy with one consumer shouldn't be using 100% CPU so there must be something going wrong somewhere.

matthewrj commented 6 years ago

Fixed by https://github.com/LMAX-Exchange/disruptor/issues/219

hsyed commented 6 years ago

Aah fantastic, sorry I dissapeared. I will try to test next week.

bogdandrutu commented 6 years ago

@hsyed the official fix (https://github.com/LMAX-Exchange/disruptor/issues/219) is not yet released.

bogdandrutu commented 6 years ago

Sorry all for tacking so long (main issue was that we couldn't reproduce this, probably because of the version of the java that we use). We will do a 0.12.2 release today that will include this fix.