Closed hsyed closed 6 years ago
@hsyed thanks for reporting this. Would it be possible to share a hello-world up where I can reproduce/debug this problem (if you already have something).
Thanks
So I am on high Sierra with a Mac Pro.
The problem shows up in two applications, both have a lot of DI and it is a bazel build, So it's hard to extract but the problem is simple enough to reproduce in the GRPC repo.
During the process of debugging I switched from distroless to our alpine glibc oracle jdk8 base image and saw the problem * 10 in JMX. The service is idling and the disruptor thread maxes out a core.
// get some tracing going, run after grpc service is started.
Tracing.getTraceConfig().updateActiveTraceParams(TraceParams.DEFAULT.toBuilder().setSampler(Samplers.alwaysSample()).build());
Tracing.getExportComponent().getSampledSpanStore().registerSpanNamesForCollection(
PciServiceGrpc.getServiceDescriptor().getMethods().stream().map(m -> m.getFullMethodName()).collect(Collectors.toList())
I unwired everything till the grpc AbstractServerImplBuilder::build();
call. I unhooked the TLS etc so it was just a ...forPort(123).build
. I then removed The tracing setup block above and the ZPages module / zipkin module and took the opencensus impl jars out of the Bazel target. The problem goes away.
In addition to the image linked in the first post this is the OpenJDK base image from distroless. The quickest way to reproduce should be to add the rules_docker
to the GRPC examples in the GRPC repo, along with the opencensus impl jars and just launch it. Might be an idea to start a Bazel workspace in this repo.
Hi @hsyed , I tried but could not reproduce the issue. Would you please help check what steps I missed?
I've done these on my Mac Pro Sierra 10.12.6:
grpc-java
1.7.0 repo.rules_docker
under examples
(changing existing WORKSPACE
and BUILD.bazel
). Nothing unusual is observed. I've also tried the alpine glibc oracle jdk 8 but everything is still good.
@hsyed ping on this
I am experiencing a similar issue. I experience this on my Ubuntu machine and also on linux boxes in the cloud (https://cloud.google.com/container-optimized-os/docs/). I have created a minimal project to reproduce the issue here.
According to the LMAX docs, SleepingWaitStrategy with one consumer shouldn't be using 100% CPU so there must be something going wrong somewhere.
Aah fantastic, sorry I dissapeared. I will try to test next week.
@hsyed the official fix (https://github.com/LMAX-Exchange/disruptor/issues/219) is not yet released.
Sorry all for tacking so long (main issue was that we couldn't reproduce this, probably because of the version of the java that we use). We will do a 0.12.2 release today that will include this fix.
What version of OpenCensus are you using?
0.8, combined with grpc-java 1.7.0 and the correct netty stack.
What JVM are you using (
java -version
)?Current distroless openjdk java image (20% cpu core thrashing), alpine glibc oracle jdk 8 see.
What did you do?
ZPages is active along with "always sample" the server is idling in docker.
What did you expect to see?
< 0.10% core usage -- just as when running on OS X.
What did you see instead?
20% with openjdk and 100% with alpine-oracle of a core thrashing.
The project artefacts are build with Bazel. It's possible I might be missing something or that the slimmed down images are missing some native libraries ?