Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.35k stars 1.99k forks source link

[BUG] No telemetry exported form a Scala notebook when using AzureMonitorExporterBuilder #41856

Open greatvovan opened 1 month ago

greatvovan commented 1 month ago

Describe the bug When running a Scala notebook in a Databricks cluster, instrumentation using AzureMonitorExporterBuilder does not produce any output in the App Insights backend.

Exception or Stack Trace No errors.

To Reproduce

  1. Have a Databricks workspace
  2. Have an App Insights instance
  3. Create a smallest compute cluster with default configuration (I used "Personal Compute" policy).
  4. Install com.azure:azure-monitor-opentelemetry-exporter:1.0.0-beta.30 in the cluster.
  5. Create a Scala notebook
  6. Run the below code in the notebook (copy in a cell and run)

Code Snippet

  val aiConnStr = "..."

  import org.apache.logging.log4j.LogManager
  import org.apache.logging.log4j.Level
  import org.apache.logging.log4j.core.config.builder.api.ConfigurationBuilderFactory
  import org.apache.logging.log4j.core.config.Configurator
  import io.opentelemetry.instrumentation.log4j.appender.v2_17.OpenTelemetryAppender
  import io.opentelemetry.api.common.Attributes
  import io.opentelemetry.api.common.AttributeKey
  import io.opentelemetry.sdk.autoconfigure.AutoConfiguredOpenTelemetrySdk
  import com.azure.monitor.opentelemetry.exporter.AzureMonitorExporter

  val configBuilder = ConfigurationBuilderFactory.newConfigurationBuilder()
  val configuration = configBuilder
    .add(
      configBuilder
        .newAppender("Console", "CONSOLE")
    )
    .add(
      configBuilder
        .newAppender("Otel", "OpenTelemetry")
    )
    .add(
      configBuilder 
        .newRootLogger(Level.INFO)
        .add(configBuilder.newAppenderRef("Console"))
        .add(configBuilder.newAppenderRef("Otel"))
    )
    .build(false)
  Configurator.initialize(configuration)

  val sdkBuilder = AutoConfiguredOpenTelemetrySdk.builder()

  AzureMonitorExporter.customize(sdkBuilder, aiConnStr)

  val openTelemetry = sdkBuilder.build().getOpenTelemetrySdk()
  OpenTelemetryAppender.install(openTelemetry)
  println("Configured")

  val logger = LogManager.getLogger()

  logger.info("ScalaEA Log outside span")
  println("log sent")

  val attributes = Attributes.of[java.lang.String, java.lang.Long](
    AttributeKey.stringKey("foo"), "bar",
    AttributeKey.longKey("code"), 42L
  )

  val tracer = openTelemetry.getTracer("my-notebook")
  val span = tracer.spanBuilder("My-span-ScalaEA").startSpan()
  val scope = span.makeCurrent()
  try {
    span.addEvent("ScalaEA span event", attributes)
    logger.info("ScalaEA Log from span")
    Thread.sleep(1000L)
  }
  finally {
    scope.close()
    span.end()
  }
  println("trace sent")

  val meter = openTelemetry.getMeter("my-notebook")
  val gauge = meter.gaugeBuilder("my-gauge-ScalaEA").build()

  gauge.set(111, attributes)
  println("metric sent")

  openTelemetry.shutdown()

  println("Done")

~Important: run the configuration code (till sdkBuilder.build().getOpenTelemetrySdk()) only once (see #41859). If you ran it and got ConfigurationException, restart the compute cluster (Run -> Restart compute resource).~ – fixed in version 1.0.0-beta.29.

The above code as a gist: https://gist.github.com/greatvovan/a2148bccf2c0e0c2305e8a35e0779dc3

Alternative (with manual configuration): https://gist.github.com/greatvovan/5e9e3d8ecaf4e210ade619c5b55455b3

Expected behavior 3 entries in traces table (2 logs, 1 event), 1 entry in customMetrics, 1 entry in dependencies table (span) created in App Insights

Screenshots N/A

Additional context

Setup (please complete the following information):

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

EDIT: Updated the gists to look more like linear Scala programs. EDIT: Updated the code for version 1.0.0-beta.29 (the gists remain for version 28). EDIT: Bumped version to 1.0.0-beta.30.

joshfree commented 1 month ago

@lmolkova could you please take a look?

lmolkova commented 1 month ago

Thank you for reporting this @greatvovan!

I wonder if the problem could be around Scala notebook having different shutdown hooks and exiting before anything is exported.

Could you please try calling openTelemetry.shutdown() and awaiting it to complete to eliminate this? Thank you!

Otherwise, @jeanbisutti @heyams or @trask should be able to investigate further.

greatvovan commented 1 month ago

@lmolkova I tried openTelemetry.shutdown() as well as Thread.sleep(...) at the end, but it did not change anything.

juliuswwj commented 1 month ago

the following code snippet works in Scala under Linux, but does NOT work in a Databricks notebook/job with a compute cluster 15.4 LTS ML (includes Apache Spark 3.5.0, Scala 2.12). it seems the issue exists in the Databricks runtime

import io.opentelemetry.api.OpenTelemetry
import io.opentelemetry.api.common.Attributes
import io.opentelemetry.api.common.AttributeKey
import io.opentelemetry.sdk.autoconfigure.AutoConfiguredOpenTelemetrySdk
import com.azure.monitor.opentelemetry.exporter.AzureMonitorExporter

object Main {
  def test(aiConnStr: String): String = {
    val sdkBuilder = AutoConfiguredOpenTelemetrySdk.builder()

    AzureMonitorExporter.customize(sdkBuilder, aiConnStr);

    val openTelemetry = sdkBuilder.build().getOpenTelemetrySdk()
    println("Auto-configured " + openTelemetry.toString())

    val attributes = Attributes.of[java.lang.String, java.lang.Long](
      AttributeKey.stringKey("foo"), "bar",
      AttributeKey.longKey("code"), 42L
    )

    val tracer = openTelemetry.getTracer("my-notebook")
    val span = tracer.spanBuilder("My-span-ScalaEA").startSpan()
    val scope = span.makeCurrent()
    try {
      span.addEvent("XScalaEA span event", attributes)
      Thread.sleep(1000L)
    }
    finally {
      scope.close()
      span.end()
    }
    println("trace sent")

    val meter = openTelemetry.getMeter("my-notebook")
    val gauge = meter.gaugeBuilder("my-gauge-ScalaEA").build()

    gauge.set(111, attributes)
    println("metric sent")

    openTelemetry.shutdown()

    Thread.sleep(5000L)

    "done"
  }

  def main(args: Array[String]): Unit = {
    test("...")
  }
}
    libraryDependencies += "io.opentelemetry" % "opentelemetry-sdk" % "1.42.1",
    libraryDependencies += "io.opentelemetry" % "opentelemetry-api" % "1.42.1",
    libraryDependencies += "com.azure" % "azure-monitor-opentelemetry-exporter" % "1.0.0-beta.29"
greatvovan commented 1 month ago

Yes, that's exactly what I am saying in the initial post. I am not ready to blame the Databricks runtime, though. It should be a regular Java runtime. I'd rather blame the library. Maybe developers can help us enable logging from the library to stdout to troubleshoot it faster...

trask commented 2 weeks ago

hi @greatvovan! I was able to reproduce your findings

and it took me a while, but I finally found where the logs go in azure databricks 😅

and saw this:

24/10/22 22:20:52 WARN ChannelInitializer: Failed to initialize a channel. Closing: [id: 0xec533e08]
java.lang.NoSuchMethodError: io.netty.handler.codec.http.HttpClientCodec.<init>(Lio/netty/handler/codec/http/HttpDecoderConfig;ZZ)V

so it looks like databricks is bringing their own (older) version of netty which is conflicting with the version used by Azure SDK libraries

the good news is that Azure SDK libraries support alternate http client implementations, so I think you should be able to exclude com.azure:azure-core-http-netty (which is a transitive dependency of com.azure:azure-monitor-opentelemetry-exporter) and include com.azure:azure-core-http-okhttp instead

greatvovan commented 2 weeks ago

Hi @trask! Thank you for investigation, at least now we understand the reason.

In terms of fixing, how can we tell the library to stop using Netty and start using OkHttp? I installed com.azure:azure-core-http-okhttp:1.12.4 but it did nothing. I also tried installing io.netty:netty-all:4.1.114.Final or using com.azure:azure-core-http-netty:1.15.5 as root in hope that it will replace a part of the dependency tree, but with no avail.

I think you should be able to exclude com.azure:azure-core-http-netty

I don't quite understand how I can exclude a library in Databricks, and not sure it is possible, as it may be used internally.

For better diagnostics, do you not think it would be better to check all necessary dependencies in initialization stage, e.g. in AutoConfiguredOpenTelemetrySdkBuilder.build() and throw a clear exception rather than being failing silently?

Also, could you explain how to enable/find the logging you quoted, so that I could possibly try troubleshooting on my own?

trask commented 2 weeks ago

I think you should be able to exclude com.azure:azure-core-http-netty

so if this was a maven project, here's what I would do:

        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-monitor-opentelemetry-exporter</artifactId>
            <version>1.0.0-beta.29</version>
            <exclusions>
                <exclusion>
                    <groupId>com.azure</groupId>
                    <artifactId>azure-core-http-netty</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-core-http-okhttp</artifactId>
            <version>1.12.4</version>
        </dependency>

I didn't try it in Databricks, but it looks like they give you an option to add exclusions:

Image

so I'd try to install com.azure:azure-monitor-opentelemetry-exporter with exclusion ~com.azure:azure-core-http-okhttp~ com.azure:azure-core-http-netty

and then also install com.azure:azure-core-http-okhttp

Also, could you explain how to enable/find the logging you quoted, so that I could possibly try troubleshooting on my own?

yeah, open your cluster and go to the Driver logs tab, then look at the log4j logs

greatvovan commented 2 weeks ago

@trask you saved my day! Or maybe week...

Indeed, there is an option to exclude, which worked as expected.

Now traces and metrics are coming (which is the main thing for me), but logs are not. Not sure if it's still related to library conflicts, I assume the same client is responsible for exporting all kinds of telemetry, and now I don't see such complaints in driver logs anyway.

I activate logging this way:

import org.apache.logging.log4j.{LogManager, Logger, Level}
import org.apache.logging.log4j.core.{LoggerContext}
import org.apache.logging.log4j.core.appender.ConsoleAppender
import org.apache.logging.log4j.core.appender.ConsoleAppender.{Builder => ConsoleBuilder}
import org.apache.logging.log4j.core.layout.PatternLayout
import org.apache.logging.log4j.core.config.builder.api.ConfigurationBuilderFactory
import org.apache.logging.log4j.core.config.{AppenderRef, LoggerConfig}
import io.opentelemetry.instrumentation.log4j.appender.v2_17.OpenTelemetryAppender
import io.opentelemetry.instrumentation.log4j.appender.v2_17.OpenTelemetryAppender.{Builder => OtelBuilder}

val context = LogManager.getContext(false).asInstanceOf[LoggerContext]
val config = context.getConfiguration

val layout = PatternLayout.newBuilder().withPattern("%d{yyyy-MM-dd HH:mm:ss} %-5level %logger{36} - MARKER %msg%n").build()
val consoleAppender = ConsoleAppender.newBuilder().asInstanceOf[ConsoleBuilder[_]]
  .setName("MyConsoleAppender").asInstanceOf[ConsoleBuilder[_]]
  .setLayout(layout).asInstanceOf[ConsoleBuilder[_]]
  .setTarget(ConsoleAppender.Target.SYSTEM_OUT).asInstanceOf[ConsoleBuilder[_]]
  .build()
consoleAppender.start()

val otelAppender = OpenTelemetryAppender.builder().asInstanceOf[OtelBuilder[_]]
  .setName("MyOtelAppender").asInstanceOf[OtelBuilder[_]]
  .setCaptureMapMessageAttributes(true).asInstanceOf[OtelBuilder[_]]
  .setCaptureContextDataAttributes("*").asInstanceOf[OtelBuilder[_]]
  .build()
otelAppender.start()

val rootLoggerConfig = config.getRootLogger
rootLoggerConfig.addAppender(consoleAppender, null, null)
rootLoggerConfig.addAppender(otelAppender, null, null)

context.updateLoggers()

I connect a ConsoleAppender just to have an indication in stdout that my logs are handled. Then initialize the Exporter and install the Appender.

val sdkBuilder = AutoConfiguredOpenTelemetrySdk.builder()
AzureMonitorExporter.customize(sdkBuilder, aiConnStr)
val openTelemetry = sdkBuilder.build().getOpenTelemetrySdk()
OpenTelemetryAppender.install(openTelemetry)

Cluster libraries:

Image

Logging:

val logger = LogManager.getLogger("My-notebook")
logger.error("My test log")

I see the result in the cell output and cluster's stdout (with formatting I set), so I conclude that logging system is set up fine. Console and Otel appenders are added identically, and should emit messages simultaneously, but not a single log record is coming to the cloud. How can we further troubleshoot the issue?

trask commented 2 weeks ago

I think I got it working.

Just change OpenTelemetryAppender.install(openTelemetry) to otelAppender.setOpenTelemetry(openTelemetry).

It seems that OpenTelemetryAppender.install(openTelemetry) doesn't work when configuring things dynamically (as opposed to via log4j2.xml file).

Also, I found it helpful to add this to the top of the notebook:

Configurator.reconfigure();
greatvovan commented 2 weeks ago

Indeed, appender's setOpenTelemetry() connected the dots! It is unbelievable, we've got from nothing to all three types of telemetry. Thanks a lot for help.

To be more accurate, though, the difference seems to be not in static vs. dynamic use, but elsewhere. The following code, if put into a standalone Scala program,

  val configBuilder = ConfigurationBuilderFactory.newConfigurationBuilder()
  val configuration = configBuilder
    .add(
      configBuilder
        .newAppender("Console", "CONSOLE")
    )
    .add(
      configBuilder
        .newAppender("Otel", "OpenTelemetry")
        .addAttribute("captureMapMessageAttributes", true)  // Capture extra MapMessage attributes
        .addAttribute("captureContextDataAttributes", "*")  // Capture ThreadContext attributes
    )
    .add(
      configBuilder 
        .newRootLogger(Level.INFO)
        .add(configBuilder.newAppenderRef("Console"))
        .add(configBuilder.newAppenderRef("Otel"))
    )
    .build(false)
  Configurator.initialize(configuration)

  val sdkBuilder = AutoConfiguredOpenTelemetrySdk.builder()
  AzureMonitorExporter.customize(sdkBuilder, aiConnStr)
  val openTelemetry = sdkBuilder.build().getOpenTelemetrySdk()
  OpenTelemetryAppender.install(openTelemetry)

also configures logging in run time, but it is compatible with OpenTelemetryAppender.install(openTelemetry). Unfortunately configuration with ConfigurationBuilderFactory does not work under Databricks, so we can't test install() method because the problem happens in the chain before the appender. This is unfortunate also because Log4j discourages modifying configuration components directly, and my current approach (that worked) goes against declarative methods like ConfigurationBuilderFactory.

trask commented 2 weeks ago

To be more accurate, though, the difference seems to be not in static vs. dynamic use, but elsewhere

I see different behavior, you can check out my Java repro:

https://github.com/trask/log4j-dynamic-config-example/blob/836e08bd2452296a2e644e0c87eef2989db02612/src/main/java/io/opentelemetry/example/logappender/Application.java#L38-L39

greatvovan commented 2 weeks ago

No, I mean, if you use ConfigurationBuilder. Check this out:

https://github.com/greatvovan/log4j-dynamic-config-example/blob/main/src/main/java/io/opentelemetry/example/logappender/Application2.java#L46-L47

It is still dynamic, but OpenTelemetryAppender.install() does its work.

Not sure to which extent it is useful, but I think it's enough to say that the problem lies deeper than static vs dynamic split.

trask commented 1 week ago

Check this out:

https://github.com/greatvovan/log4j-dynamic-config-example/blob/main/src/main/java/io/opentelemetry/example/logappender/Application2.java#L46-L47

ah, I see, thanks!

trask commented 1 week ago

btw, not sure if this might be interesting / relevant to follow: https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/12468

trask commented 1 week ago

@greatvovan just checking if it's ok to close this issue, or if you'd like to leave it open? thanks

greatvovan commented 1 week ago

My issue is resolved, but I am thinking if we can derive moral from the story. Could the library handle the problem of missing/incorrect dependency better?

Following up on this comment, wouldn't it be better to fail more loudly than posting a quiet warning? Was it the intended design to not interfere with the application in case of problems, or we get that as an undesired side effect? My thinking is that if the user added instrumentation, it become an integral part of the program and cannot be easily sacrificed.