elasticsearch is using apm-java-agent as the underlying implementation in the apm module.we are using our own apm api, implemented in apm-module with OTEL api. This should not change.
What should change is the binding between otel api and the implementation. Which should be otel sdk. Otel SDK will allow us to get more flexibility on configuring how our metrics and traces are sent to apm server (apm server support otel sdk).
With Otel sdk we will be able to implement features like 'tee-ing' (splitting to two apm server) of the export or some additional buffering, retries when apm server is overloaded.
[ ] 3. rework of logging - otel sdk is using JUL logging. We already have a bridge in server https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/common/logging/JULBridge.java If we refactor it into a lib it should be possible to use it in the module. I wasn't able to make it work without adding a JUL -log4j bridge dependency (which we don't want) but at the same time I was rushing it.. We want to make sure we use the same apm_agent.json file (probably not worth renaming). This is a good example where plugin's might want to add an appender.
[ ] 4. careful review of the dependencies. otel sdk requires quite a lot, some of the dependencies like okhttp are not working with java9's modules. some are introducing a 'clash version' dependencies (netty is already a dependency of server). It all shouldn't be a problem if we hide the implementation behind the embedded classloader like we do for x-content and jackson
[ ] 5. due to use of java beans the java9's module require a java.desktop. This feels awkward, but I am not sure how to go around it.
[ ] 6. OTEL sdk buildAndRegister can only be called once. If it is called twice and exception will be thrown. We need to make sure that starting/stopping the metering (this is possible now) will not throw this exception.
[ ] 7. apm-java-agent gives us a bunch of out of the box metrics for the jvm. I copied the JvmJdMetrics from apm-java-agent repo. Perhaps we need to work with apm team to have this as a lib? Just copying the JvmFdMetrics, JvmGcMetrics, JvmMemoryMetrics could work initially, but feels dirty. The naming there has to also comply with our naming convention (we would register them using our with own api)
[ ] 8. BIG - review the security manager permission. It would be a relatively tedious and long task, as there is a loot of new dependencies. For the PoC I have disabled the security manager
[ ] 9. New Exporters - we could simply configure the out of the box available exporters (simply adding 2 for the support of exporting to 2 apm servers) or implement our own so that we have more control of logging etc
[ ] 10. otel sdk exporters support http and grpc protocoles. APM server works with both. Need to decide on one.
being able to set custom interval for certain metrics
it is possible to create multiple MetricReaders(those trigger exporting) at different interval. So therefore we could maybe have some custom filtering what metrics are read by what MetricsReader (and thus exported at different intervals). does not seem trivial though.
change metric interval dynamically
it is possible to provide a custom java.util.concurency.Scheduler. so if only we implement a logic that cancel's previous scheduled task and submit a new one with different interval it should be doable.
make sure metrics are sent upon node shutdown
there is a force flush mechanism. I am not sure if it is possible to set a timeout on it though
Description
elasticsearch is using apm-java-agent as the underlying implementation in the apm module.we are using our own apm api, implemented in apm-module with OTEL api. This should not change. What should change is the binding between otel api and the implementation. Which should be otel sdk. Otel SDK will allow us to get more flexibility on configuring how our metrics and traces are sent to apm server (apm server support otel sdk). With Otel sdk we will be able to implement features like 'tee-ing' (splitting to two apm server) of the export or some additional buffering, retries when apm server is overloaded.
I worked on a simple very dirty PoC where this proves to work https://github.com/elastic/elasticsearch/pull/110263 Things that need more investigation and work: