DataDog / dd-trace-js

JavaScript APM Tracer
https://docs.datadoghq.com/tracing/
Other
650 stars 306 forks source link

dd-trace causes crash in v8 internals, SIGABRT #1229

Closed doronrk closed 3 years ago

doronrk commented 3 years ago

Describe the bug

After upgrading to node 12.20.1, we are getting the below crash on our localhost consistently. DD_TRACE_ENABLED=false resolves the issue, so it looks like dd-trace is implicated.

W20210201-11:27:47.096(-8)? (STDERR) program[53589]: ../src/api/callback.cc:123:void node::InternalCallbackScope::Close(): Assertion `(env_->execution_async_id()) == (0)' failed.
W20210201-11:27:47.097(-8)? (STDERR)  1: 0x1011f62d5 node::Abort() (.cold.1) [program]
W20210201-11:27:47.098(-8)? (STDERR)  2: 0x1000a5a29 node::Abort() [program]
W20210201-11:27:47.098(-8)? (STDERR)  3: 0x1000a5891 node::Assert(node::AssertionInfo const&) [program]
W20210201-11:27:47.099(-8)? (STDERR)  4: 0x100001c96 node::InternalCallbackScope::Close() [program]
W20210201-11:27:47.099(-8)? (STDERR)  5: 0x10000223a node::InternalMakeCallback(node::Environment*, v8::Local<v8::Object>, v8::Local<v8::Object>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*, node::async_context) [program]
W20210201-11:27:47.100(-8)? (STDERR)  6: 0x100016598 node::AsyncWrap::MakeCallback(v8::Local<v8::Function>, int, v8::Local<v8::Value>*) [program]
W20210201-11:27:47.100(-8)? (STDERR)  7: 0x10015d560 node::StreamBase::CallJSOnreadMethod(long, v8::Local<v8::ArrayBuffer>, unsigned long, node::StreamBase::StreamBaseJSChecks) [program]
W20210201-11:27:47.101(-8)? (STDERR)  8: 0x10015f2d1 node::EmitToJSStreamListener::OnStreamRead(long, uv_buf_t const&) [program]
W20210201-11:27:47.102(-8)? (STDERR)  9: 0x1000c3224 node::(anonymous namespace)::Parser::OnStreamRead(long, uv_buf_t const&) [program]
W20210201-11:27:47.103(-8)? (STDERR) 10: 0x100163d6f node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) [program]
W20210201-11:27:47.105(-8)? (STDERR) 11: 0x100975355 uv__stream_io [program]
W20210201-11:27:47.107(-8)? (STDERR) 12: 0x10097deac uv__io_poll [program]
W20210201-11:27:47.109(-8)? (STDERR) 13: 0x10096a651 uv_run [program]
W20210201-11:27:47.110(-8)? (STDERR) 14: 0x1000e8d97 node::NodeMainInstance::Run() [program]
W20210201-11:27:47.111(-8)? (STDERR) 15: 0x10007b0e9 node::Start(int, char**) [program]
W20210201-11:27:47.112(-8)? (STDERR) 16: 0x7fff6b74bcc9 start [/usr/lib/system/libdyld.dylib]
=> Exited from signal: SIGABRT
=> Your application is crashing. Waiting for file change.

Environment

localhost development

OS X 10.15.7

12.20.1

0.30.5

N/A, running on my localhost

rochdev commented 3 years ago

I wasn't able to reproduce the issue with a simple app, so a few follow up question:

rochdev commented 3 years ago

Just tried with a brand new Meteor app and I still can't replicate the issue.

doronrk commented 3 years ago

Are you running directly on the host or in a Docker container?

This is currently happening on all of our eng team's laptops running OS X.

Does the error happen right at startup or does it happen later?

It never happens right away, only after clicking around through our application a little bit.

Can you only replicate when using Meteor?

So far we have only replicated using our meteor application.

What if you update only either the tracer or Meteor but not both?

The crash happens on meteor 2.0 === node 12.20.1, regardless of which dd-trace version is used (only tried with 0.30.5 and 0.29.1)

The crash does not happen on meteor 1.11.1 === node 12.18.4, with dd-trace 0.29.1 - I have yet to test with only upgrading dd-trace to 0.35.5

Does it happen with a brand new Meteor app?

I have not tried yet.

What Node version did you upgrade from?

We upgraded from node 12.18.4

Any chance that you can provide a replication snippet?

I'll try my best to get around to this, but it will be tough for us. The most tempting route at this point is to hold off on upgrading meteor/node (which sucks) so we can focus on stuff that's more core to our business.

rochdev commented 3 years ago

The error seems to come from a C++ check in async_hooks, so it's possible it's a bug in Node that is triggered by the combination of your app, the tracer, and async_hooks which we use for context propagation. What happens if you upgrade only Meteor and not Node? Also, do you get the same issue if you upgrade to latest 14 instead of latest 12?

doronrk commented 3 years ago

What happens if you upgrade only Meteor and not Node?

My understanding is that the versions are tightly coupled. I'll check whether this is possible, but my understanding is that it is not.

Also, do you get the same issue if you upgrade to latest 14 instead of latest 12?

I don't believe there are any versions of meteor that use node 14. Meteor 2.0 is the latest release, which uses node 12

pozylon commented 3 years ago

I have the same issue, had to DD_TRACE_ENABLED: 0, it's not meteor 2.0, downgrading does not help

bengl commented 3 years ago

Hi @pozylon!

We believe the issue is caused by using AsyncLocalStorage to track async context. This might be a bug upstream in Node.js, or in Meteor, or in the Datadog code, but @doronrk has confirmed that initializing with the async_resource scope manager instead (e.g. via tracer.init({ scope: 'async_resource' })) is a workaround for now.

If either of you can help us reproduce this, it would be super helpful for tracking down the bug so it can be fully fixed.

rochdev commented 3 years ago

The issue should be fixed in 0.30.6. However, as mentioned above, the fix was to stop using AsyncLocalStorage meaning that this is only working around the issue. It's possible that we eventually make changes to our code that cause a regression unless we find the actual issue. It would be appreciated if one of you can provide a replication snippet so that we can also provide a fix upstream in either Node or Meteor and resolve the issue completely.

bengl commented 3 years ago

I'm actually going to go ahead and re-open this, since the underlying issue is not resolved.

rochdev commented 3 years ago

@bengl Given the recent development that the issue is actually with Meteor and can only be fixed by them, should we close this issue since this is not something we can do something about?

bengl commented 3 years ago

Yeah, let's close this since the issue is in how two libraries interoperate that are not within this repo, and the issue that was immediately presenting itself to @doronrk and others was fixed in a patch release. We won't be doing any further fixes in this project/repository with respect to fibers, and the situation is stagnant until Meteor moves off of fibers.

We've also added to our official documentation regarding fibers. https://docs.datadoghq.com/tracing/setup_overview/compatibility_requirements/nodejs/#fibers

We can reopen this if necessary.