elastic / apm-agent-java

https://www.elastic.co/guide/en/apm/agent/java/current/index.html
Apache License 2.0
567 stars 321 forks source link

Support for JFR Profiling #1710

Open tobiasstadler opened 3 years ago

tobiasstadler commented 3 years ago

It would be nice if the agent could process events generated by the Java Flight Record and send them to the server. Where possibl3 the events should be linked to traces/spans. The events should then be displayed in the apm app in Kibana in a nice way (e.g. like JDK Mission Control). In addition to that it would be nice, if one could start/stop profiling using Kibana or even select which event should be gathered by the agent.

Datadog has assimilate feature called Continuous Profiler.

felixbarny commented 3 years ago

We have plans to extend the integration with async-profiler that we currently use for profiler inferred spans.

Currently, we use async-profiler to create additional spans for slow methods but we do plan to extend the integration so that we can show flamegraphs for CPU, allocation, and lock profiling. We don't have a specific timeline for that yet. We'll probably make it configurable (also using the UI) whether to run the profiler continuously or just for a period of time.

Linking individual events, such as an object allocation, to a specific trace sounds like a cool feature but will most likely not be something we'd do in the first version. It's not impossible and for profiler-inferred spans, we already collect the information when which span has been active on which thread. But storing that information would likely considerably increase the storage requirement so that it might not be a good tradeoff given that you'd probably mostly want to analyze the allocations/cpu/lock contention on a global (per service) level. I'm not saying that drilling down to a specific trace group or even instance wouldn't be useful but when you want to find out what slows down your service in general, you probably don't need to drill down into those details. And if you want to find out why a particular instance of a trace was slow, profiler inferred spans can help you.

As for async-profiler vs JFR, some of the reasons why we chose async-profiler as it can recover stack traces that JFR/AsyncGetCallTrace can't gather (such as JVM intrinsic ex. System.arraycopy; more details here: https://github.com/apangin/java-profiling-presentation), can capture kernel stacks, and because it's compatible with more Java versions.

tobiasstadler commented 3 years ago

My intention is not to replace the async-profiler (which is a great tool), but to use JFR for stuff aync-profiler can't do (as far as I know).

E.g. I would like to be able to do something like

try (RecordingStream rs = new RecordingStream()) {
            rs.enable("jdk.SafepointBegin").withThreshold(Duration.ofMillis(0));
            rs.enable("jdk.G1GarbageCollection").withThreshold(Duration.ofMillis(0));
            rs.enable("jdk.ObjectCount").withPeriod(Duration.ofSeconds(1));
            rs.enable(...).withPeriod(Duration.ofSeconds(1));
            ...

            rs.onEvent(re -> {
                //send event to apm-server/elasticsearch
            });
            rs.start();
        }

or

new Recording().start();
Recording r = FlightRecorder.getFlightRecorder().takeSnapshot();
//send events to apm-server/elasticsearch

via the APM App in Kibana for a specific service or host or ...

It would be nice if I could select the events, which should be recorded, in Kibana. It would be even nicer if custom events are supported. But I am also happy with a fixed set of events.

Since JFR was back ported to JDK 8, support for it is not that bad anymore..

tobiasstadler commented 3 years ago

https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/jfr-streaming is roughly was I was thinking of.

felixbarny commented 3 years ago

Looks interesting! That creates metrics out of the JFR stream instead of ingesting them as single events, though.

Once we add support for the OTel metrics API (which is still in flux) we might just support the jfr-streaming module. For now, we're focussing on supporting the OTel tracing API first.

tobiasstadler commented 3 years ago

Mapping the events to metrics is a good start in my opinion. But I don‘t think every event can be mapped to a metric (e.g. thread creation).

Also the integration with Kibana (visualization of the data, dynamically changing the captured events, …) is missing.

tobiasstadler commented 3 years ago

Having something like https://github.com/flight-recorder/health-report in Kibana would be nice