getsentry / sentry-java

A Sentry SDK for Java, Android and other JVM languages.
https://docs.sentry.io/
MIT License
1.16k stars 435 forks source link

Using kotlinx.coroutines.flow results in missing instrumentation #3142

Open rodolfoBee opened 10 months ago

rodolfoBee commented 10 months ago

Description

When using kotlinx.coroutines.flow to consume messages, the trace information is lost when calling methods (collect) in this package. After the kotlinx.coroutines.flow.collect(..) part is completed, tracing information is still missing.

Steps to reproduce:

  1. Download repro: sample-project.zip
  2. Add DSN in src/main/resources/application.yml
  3. Run the project.

(screenshots available in the project)

The project is using the @SentrySpan notation to create custom spans for specific function, but the result is still missing instrumentation. Creating spans manually with startChild api from the custom instrumentation page also does not provide the expected result.

The expected result is a transaction with spans for the main controller and all three Components, while the current result is missing the spans from the third component.

adinauer commented 10 months ago

Hey @rodolfoBee.

I wouldn't use globalHubMode: true as this will probably leak data between requests.

I can't see any usage of SentryContext() in your sample. Can you please give the following a try:

    @FlowPreview
    @GetMapping("/test1")
    @SentrySpan
    suspend fun test1(): ResponseEntity<String> = withContext(SentryContext()) {
        ResponseEntity.ok(callTestComponents())
    }

    @FlowPreview
    @GetMapping("/test2")
    @SentrySpan
    suspend fun test2(): ResponseEntity<String> = runBlocking(SentryContext()) {
        ResponseEntity.ok(callTestComponents())
    }

This seems to work for me at first glance for test2 at least.

Here's the spans on the transaction:

    "spans": [
        {
            "start_timestamp": 1705384736.075755,
            "timestamp": 1705384737.770012,
            "trace_id": "eeb6437174e941219853598b8a9032ef",
            "span_id": "01f052ae6583405f",
            "parent_span_id": "d5c8d163b4a74cab",
            "op": "MessageController.test2",
            "status": "ok",
            "origin": "auto.function.spring.advice",
            "data": {
                "thread.name": "http-nio-8080-exec-1",
                "thread.id": "25"
            }
        },
        {
            "start_timestamp": 1705384736.090173,
            "timestamp": 1705384736.726437,
            "trace_id": "eeb6437174e941219853598b8a9032ef",
            "span_id": "1ec98d5721704c57",
            "parent_span_id": "01f052ae6583405f",
            "op": "TestComponent1.fun1",
            "status": "ok",
            "origin": "auto.function.spring.advice",
            "data": {
                "thread.name": "http-nio-8080-exec-1",
                "thread.id": "25"
            }
        },
        {
            "start_timestamp": 1705384736.298416,
            "timestamp": 1705384736.726357,
            "trace_id": "eeb6437174e941219853598b8a9032ef",
            "span_id": "68e14b3ccfac49e7",
            "parent_span_id": "1ec98d5721704c57",
            "op": "TestComponent2.fun2",
            "status": "ok",
            "origin": "auto.function.spring.advice",
            "data": {
                "thread.name": "http-nio-8080-exec-1",
                "thread.id": "25"
            }
        },
        {
            "start_timestamp": 1705384737.752633,
            "timestamp": 1705384737.764531,
            "trace_id": "eeb6437174e941219853598b8a9032ef",
            "span_id": "4d72bef8fb80490b",
            "parent_span_id": "01f052ae6583405f",
            "op": "TestComponent3.fun3",
            "status": "ok",
            "origin": "auto.function.spring.advice",
            "data": {
                "thread.name": "http-nio-8080-exec-1",
                "thread.id": "25"
            }
        }
    ],

For test1 this doesn't seem to work because it looks like the transaction is already finished after the Thread.sleep in fun1.

Not quite sure yet how to fix but my guess is you may need WebFlux instead of WebMVC and then use https://docs.spring.io/spring-framework/docs/current/kdoc-api/spring-webflux/org.springframework.web.reactive.function.server/body-and-await.html

adinauer commented 10 months ago

The following code seems to work better:

    @FlowPreview
    @GetMapping("/test1")
    @SentrySpan
    suspend fun test1(): ServerResponse = withContext(SentryContext()) {
        ServerResponse.ok().bodyValueAndAwait(callTestComponents())
    }

This has some problems regarding serialization but at least the isn't finished mid request handling.

I swapped the following in build.gradle.kts:

//  implementation("org.springframework.boot:spring-boot-starter-web")
    implementation("org.springframework.boot:spring-boot-starter-webflux")

Hope this helps. Let me know if you need more help.

adinauer commented 10 months ago

For Spring WebMVC maybe upgrading to Spring Boot 3 could help, but I didn't test.

omurkartal commented 10 months ago

Hi @adinauer, As you see, this is just a sample project to demonstrate the problem. In our real project, we use Kotlin flows in several places such as reactive DB operations. So it's important for us to see elapsed times at these calls. (And we use webflux in our real projects, not the web.)

Adding withContext(SentryContext()) or runBlocking (like I used for /test2 endpoint) to the RestController methods doesn't solve our problem.

You are right that adding these codes prevent the "tracing data is lost" issue. But, in the first call to the endpoint, I see "Missing span instrumentation" note in the tracing flow in Sentry performance page. Spans created after this point are connected to main transaction, not to the related parent span. In the sample project, call order is like: Controller->TestComponent1->TestComponent2->TestComponent3 But Sentry shows: Controller->TestComponent1->TestComponent2 + Controller->TestComponent3

We could consider this as a workaround. But, after the first call to the endpoint, we dont see "Missing span instrumentation" part any more and time elapsed at that point.

As a result, adding withContext or runBlocking is not a solution/workaround for us. We want to use Sentry tracing to detect long running (maybe problematic) parts in our project. But, since Sentry doesn't show these details, it's meaningless to us. So, we still need a solution.

Note: if we use Kotlin flow methods (i.e. toList()) in runBlocking, "tracing data is lost" issue is solved and call order is also correct. But, adding runBlocking will make these points blocking and it's not appropriate to use it in this way in reactive programming. runBlocking { flow.toList() }

adinauer commented 10 months ago

@omurkartal we'll have to look into how Spring treats coroutines internally and check if there's a place to hook in our code differently. Then we could set the SentryContext() in order to propagate the Hub correctly. Without correct hub propagation there will always be leakage between requests and/or lost data. Until we can provide an integration for Spring coroutine usage, I'm afraid the only way to get correct data into Sentry is to provide SentryContext() and ensure our current Spring integration ends the transaction at the correct time.

The following screenshot shows what happens with this code:

    @FlowPreview
    @GetMapping("/test1")
    @SentrySpan
    suspend fun test1(): String = withContext(SentryContext()) {
        callTestComponents()
    }
Screenshot 2024-01-17 at 07 13 51

Calling this endpoint multiple times, produces the same spans / chart for me. Looks like this is caused by using WebFlux instead of WebMVC which seems to handle suspend funs out of the box. So the transaction is no longer finished mid request in test1.

With this code:

    @FlowPreview
    @GetMapping("/test1")
    @SentrySpan
    suspend fun test1(): ServerResponse = withContext(SentryContext()) {
        ServerResponse.ok().bodyValueAndAwait(callTestComponents())
    }

it ends up in Sentry the same.

Can you please provide more details on your Flow usage? Is there always a single response that contains all the items in the flow or is there also some streaming where items are returned as they are produced?

An alternative approach to supporting coroutines in Spring could be to offer a new Annotation that wraps a suspend fun with withContext(SentryContext()).

So far I can see that flow.toList in TestComponent2.kt seems to cause problems where Sentry suddenly loses track of the current span and starts adding spans to the transaction. Also modifying the span from this point (in or after flow.toList on would no longer write to the same span as before the call. I don't know why this is happening yet, will update here if I figure something out.

romtsn commented 10 months ago

@adinauer @omurkartal I have a feeling you could solve the Flow nesting problem via using the .flowOn operator to preserve sentry context:

flow
  .flowOn(SentryContext()) // or e.g. (Dispatchers.IO + SentryContext()) if you already have flowOn
  .toList() 
adinauer commented 10 months ago

@romtsn looks to me like this is related to @SentrySpan not being able to handle suspend fun correctly as spans are also finished before everything is done. So we might have to change SentrySpanAdvice or create another variant of it which is used for coroutines. The same could also be true for direct reactor usage - would have to test.

In callTestComponents it also starts using the transaction as span after testComponent1.fun1() so it may not be flow related after all.

The following code replaces the @SentrySpan with manual span handling which seems to work better (didn't look at it in too much detail).

NOTE: this doesn't have any error handling

    @FlowPreview
    @GetMapping("/test1")
//    @SentrySpan
    suspend fun test1(): String = withContext(SentryContext()) {
        val span = Sentry.getSpan()?.startChild("controller explicit")
        try {
            callTestComponents()
        } finally {
            span?.finish()
        }
    }
omurkartal commented 10 months ago

Hi @adinauer @romtsn, I tested both of your suggestions in the real project we have. But, calling the endpoint multiple times doesn't produce the same span. After the first call, we don't see "Missing span instrumentation" part and time elapsed at that point.

I checked the issue a bit more. In the sample project, whole request is executed in the same thread. But, in the real project, I see that a new thread is started in repository component.

We use a DB driver which supports R2DBC (Reactive Relational Database Connectivity).

Sample db-read code:

// db: org.springframework.r2dbc.core.DatabaseClient
suspend fun getDataFromDB(..) {
    println("t0: ${Thread.currentThread().name}")
    db
      .sql("SELECT statement").also { println("t1: ${Thread.currentThread().name}") }
      .flow().also { println("t2: ${Thread.currentThread().name}") }
      .let {
        println("t3: ${Thread.currentThread().name}")
        it.toList().let { m -> println("t4: ${Thread.currentThread().name}") }
      }
      .also { println("t5: ${Thread.currentThread().name}") }
}

Logs:

controller-1: parallel-3 // printed in RestController before calling DB method
t0: parallel-3
t1: parallel-3
t2: parallel-3
t3: parallel-3
t4: db-sql-netty-thread-2
t5: db-sql-netty-thread-2
controller-2: db-sql-netty-thread-2 // printed in RestController after calling DB method
adinauer commented 10 months ago

Hi @omurkartal, the switch to another thread should not be problematic if SentryContext is used. Same for multiple requests landing on the same thread.

I'm able to reproduce the behaviour of missing fun3 when stressing the system. I did so by using the following command wrk -t8 -c30 -d30s http://127.0.0.1:8080/test1. This sends ~400 requests in 30s.

For the transactions where fun3 is missing it looks like all of those have a fun2 span with status deadline_exceeded. Can you confirm this is also true for your tests with the database? You can download the JSON from the Sentry web interface using the button on the top right of the transaction page. Ah status is also visible on the UI when opening up span details, so no need for JSON.

adinauer commented 10 months ago

Adding more spans to fun2 shows that it sometimes loses spans happening after the flow, so this might be related to flow after all. Haven't seen drops on a different point.

adinauer commented 10 months ago

Using flowOn(SentryContext()) doesn't seem to change this behaviour.

adinauer commented 10 months ago

Commenting out the flow in fun2 seems to get rid of the problem, moving the flow and adding a span after it, causes problems again where the span after the flow is sometimes dropped.

adinauer commented 10 months ago

An alternative here could be using our OpenTelemetry integration and see if that works better.

getsantry[bot] commented 9 months ago

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

omurkartal commented 9 months ago

Hi @adinauer, I could not test it in detail but I saw that the tracing data was lost in the same way with opentelemetry. its not the exactly same issue we have but I also found this issue related with open telemetry: https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/8659

adinauer commented 9 months ago

Thanks for the update @omurkartal