DataDog / dd-trace-js

JavaScript APM Tracer
https://docs.datadoghq.com/tracing/
Other
640 stars 303 forks source link

Long Running Spans are not reported #1565

Open fenos opened 3 years ago

fenos commented 3 years ago

Describe the bug Long running spans are not being reported.

I have a nodeJS worker which spawns a CLI command and we wrap that in a tracer.trace() function. When the CLI process finishes within 5-10 minutes the span is sent to APM just fine, however when it takes longer ~30 minutes it doesn't.

I've seen some reports on the PHP sdk that had this exact issue, could the node app also suffer from this bug?

We are really keen to have this fixed ASAP as it will give us a lot of visibility for our long running process, thanks

Environment APM Docker

rochdev commented 3 years ago

In our case, and I assume PHP has the same problem, the issue is that we can't send the trace until it's finished, and for very long traces, it's possible that the trace becomes too large because it contains too many spans, and the payload is rejected by the agent or even by the tracer. We don't really have a way right now to fix this as the root span needs to finish before any of its children are sent.

However, what you could try is to reduce the size of each span so that the trace payload will be smaller. One good way to do this is to use the 0.5 endpoint on the agent which deduplicates strings, and since strings are most of the size of a trace, this should significantly reduce the size of the payload. You can use the 0.5 endpoint with the DD_TRACE_AGENT_PROTOCOL_VERSION=0.5 environment variable or the corresponding protocolVersion programmatic configuration option.

Please let me know if that works for your use case.

fenos commented 3 years ago

Mmmh, interesting....

However, i guess that my spans are very small in payload size (I guess) as the only thing they do is to spawn a child_process and wait

Imagine this is the way I'm tracing:

tracer.trace('worker.analyzer', async () => {

     await tracer.trace('download', async() => {
          return downloadFile();
     })

     await tracer.trace('analyze', async() => {
         await spawnChildProcess(); // very long 30 minutes
     })

      await tracer.trace('upload', async() => {
         return uploadFile()
      })
})

in the mean time I'll try with your suggested option

fenos commented 3 years ago

I've tried to use the agent protocol 0.5 but it still wasn't able to sync the span to APM 😞

rochdev commented 3 years ago

Does the process exit right after this code is done running? If so, does it exit gracefully or is it killed with process.exit()?

rochdev commented 3 years ago

How many spans would you say are generated by the code? Is the above function called multiple times?

fenos commented 3 years ago

So the app doesn't exit since it is a worker that pulls messages from a queue, and we process 1 message at the time

I do generate 3 custom span and some others are generated by the dd-trace plugins

However the number of spans doesn't change for longer running jobs as the child process is the one taking most of the time.

The same code works for "faster" jobs but not when the child process takes ~30 mins

I could try and reproduce having a simple setTimeout within the tracer.trace callback of 30mins, I believe it's going to be the same

rochdev commented 3 years ago

A reproduction would definitely help. You could always enable debug mode to see more of that the tracer is doing, but with a reproduction I could do that on my end which might be faster in order to find and correct the issue.

fenos commented 3 years ago

@rochdev i've tried reproducing it with a simple setTimeout of 30 minutes and the span get synced just fine. So my theory was wrong.

I'm wondering if it has something to do with the child process, do you know if any of the plugin is tracing the child process in some way?

rochdev commented 3 years ago

Nothing should be tracing the child process, and if it was instrumented the traces would not be connected together since context propagation would not be supported.

You mentioned creating a few custom spans. Is it possible one of them is not finished? If the entire trace is not finished, then it will never be sent until all spans are finished.

xxxxx2334 commented 2 years ago

I'm having same issue, no trace sync, no errors, no exits. I've enabled debug and found the payload is too big. the code have a loop with sql and redis queries. I've already tried with protocol 0.5. it doesnt work. how can I make this work?

rochdev commented 2 years ago

@nicolas-zz In theory this would only happen if the tracer is larger than 50MB, which should generally not happen, especially with 0.5 that deduplicates strings. Depending on the error you are seeing maybe the logic to flush early is missing some condition. Can you share the exact error you are getting? Also, do you know how many spans are part of the trace?