DataDog / dd-trace-js

JavaScript APM Tracer
https://docs.datadoghq.com/tracing/
Other
650 stars 306 forks source link

instrument graphql-jit to not lose graphql spans post-JIT #2385

Open blimmer opened 2 years ago

blimmer commented 2 years ago

Background

We're using graphql-jit by way of the fastify mercurius plugin. JIT-ing graphql queries creates V8-optimizable functions to increase performance.

When using mercurius, there's a jit parameter you can set to enable this functionality. By default, JIT-ing is disabled.

The Issue

After the jit occurs, we lose all the GraphQL spans reported to Datadog. This makes it difficult to analyze APM spans because we don't know what query is being executed.

Reproduction Repository

I created a simple repo that shows this issue: https://github.com/blimmer/dd-trace-fastify-app

Setup:

git clone https://github.com/blimmer/dd-trace-fastify-app.git
cd dd-trace-fastify-app
yarn

To see things working with Jit turned off:

  1. Comment out the jit parameter in src/server.js, or set the value to 0 to disable it. https://github.com/blimmer/dd-trace-fastify-app/blob/3b73042d26779234a0e425d70703bf82128ddffa/src/server.js#L26
  2. Start the server: yarn start
  3. Request localhost:3000 to trigger a graphql query twice

You'll notice in the dd-trace debug output that graphql spans are being reported on every request:

Encoding payload: [{"trace_id":"05407a817cea965e","span_id":"05407a817cea965e","parent_id":"0000000000000000","name":"fastify.request","resource":"GET /","error":0,"meta":{"_dd.p.dm":"-0","service":"dd-trace-fastify-app","runtime-id":"1ec81454-92be-41ec-9852-75bc0236c607","http.url":"http://localhost:3000/","http.method":"GET","span.kind":"server","http.useragent":"curl/7.79.1","http.client_ip":"127.0.0.1","http.route":"/","http.status_code":"200","language":"javascript"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_dd.measured":1,"_sampling_priority_v1":1},"start":1663262975977008000,"duration":15307373,"service":"dd-trace-fastify-app","type":"web"},{"trace_id":"05407a817cea965e","span_id":"491bbd450238cbf4","parent_id":"05407a817cea965e","name":"graphql.parse","resource":"graphql.parse","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"1ec81454-92be-41ec-9852-75bc0236c607","language":"javascript"},"metrics":{"_dd.measured":1,"_sampling_priority_v1":1},"start":1663262975978667000,"duration":453613,"service":"dd-trace-fastify-app","type":"graphql"},{"trace_id":"05407a817cea965e","span_id":"7c3e8bb9111c86c1","parent_id":"05407a817cea965e","name":"graphql.validate","resource":"graphql.validate","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"1ec81454-92be-41ec-9852-75bc0236c607","language":"javascript"},"metrics":{"_dd.measured":1,"_sampling_priority_v1":1},"start":1663262975979259600,"duration":2723633,"service":"dd-trace-fastify-app","type":"graphql"},{"trace_id":"05407a817cea965e","span_id":"22e8b1a6bba14b1d","parent_id":"05407a817cea965e","name":"graphql.execute","resource":"{add(x:0,y:0)}","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"1ec81454-92be-41ec-9852-75bc0236c607","graphql.operation.type":"query","language":"javascript"},"metrics":{"_dd.measured":1,"_sampling_priority_v1":1},"start":1663262975982323500,"duration":7541748,"service":"dd-trace-fastify-app","type":"graphql"},{"trace_id":"05407a817cea965e","span_id":"5ec626dbd66f2aaf","parent_id":"22e8b1a6bba14b1d","name":"graphql.resolve","resource":"add:Int","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"1ec81454-92be-41ec-9852-75bc0236c607","graphql.field.name":"add","graphql.field.path":"add","graphql.field.type":"Int","language":"javascript"},"metrics":{"_sampling_priority_v1":1},"start":1663262975989088300,"duration":516846,"service":"dd-trace-fastify-app","type":"graphql"}]

Encoding payload: [{"trace_id":"51d067f3ffe906e5","span_id":"51d067f3ffe906e5","parent_id":"0000000000000000","name":"fastify.request","resource":"GET /","error":0,"meta":{"_dd.p.dm":"-0","service":"dd-trace-fastify-app","runtime-id":"1ec81454-92be-41ec-9852-75bc0236c607","http.url":"http://localhost:3000/","http.method":"GET","span.kind":"server","http.useragent":"curl/7.79.1","http.client_ip":"127.0.0.1","http.route":"/","http.status_code":"200","language":"javascript"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_dd.measured":1,"_sampling_priority_v1":1},"start":1663262997086006300,"duration":1995117,"service":"dd-trace-fastify-app","type":"web"},{"trace_id":"51d067f3ffe906e5","span_id":"743b34f1a4966cd6","parent_id":"51d067f3ffe906e5","name":"graphql.execute","resource":"{add(x:0,y:0)}","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"1ec81454-92be-41ec-9852-75bc0236c607","graphql.operation.type":"query","language":"javascript"},"metrics":{"_dd.measured":1,"_sampling_priority_v1":1},"start":1663262997086484700,"duration":1027344,"service":"dd-trace-fastify-app","type":"graphql"},{"trace_id":"51d067f3ffe906e5","span_id":"57353bc4cf33aaf6","parent_id":"743b34f1a4966cd6","name":"graphql.resolve","resource":"add:Int","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"1ec81454-92be-41ec-9852-75bc0236c607","graphql.field.name":"add","graphql.field.path":"add","graphql.field.type":"Int","language":"javascript"},"metrics":{"_sampling_priority_v1":1},"start":1663262997087312400,"duration":142090,"service":"dd-trace-fastify-app","type":"graphql"}]

To see graphql spans being dropped after Jit is turned on

  1. Set the jit parameter in src/server.js to 1: https://github.com/blimmer/dd-trace-fastify-app/blob/3b73042d26779234a0e425d70703bf82128ddffa/src/server.js#L26
  2. (re-)Start the server: yarn start
  3. Request localhost:3000 to trigger a graphql query twice

On the first request, you'll see the graphql spans being reported:

Encoding payload: [{"trace_id":"53600ada74b9c296","span_id":"53600ada74b9c296","parent_id":"0000000000000000","name":"fastify.request","resource":"GET /","error":0,"meta":{"_dd.p.dm":"-0","service":"dd-trace-fastify-app","runtime-id":"a85edf01-e2ae-444a-80e7-42dd509bf585","http.url":"http://localhost:3000/","http.method":"GET","span.kind":"server","http.useragent":"curl/7.79.1","http.client_ip":"127.0.0.1","http.route":"/","http.status_code":"200","language":"javascript"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_dd.measured":1,"_sampling_priority_v1":1},"start":1663263365476003000,"duration":11325928,"service":"dd-trace-fastify-app","type":"web"},{"trace_id":"53600ada74b9c296","span_id":"17e68050e468fe4b","parent_id":"53600ada74b9c296","name":"graphql.parse","resource":"graphql.parse","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"a85edf01-e2ae-444a-80e7-42dd509bf585","language":"javascript"},"metrics":{"_dd.measured":1,"_sampling_priority_v1":1},"start":1663263365477144600,"duration":336670,"service":"dd-trace-fastify-app","type":"graphql"},{"trace_id":"53600ada74b9c296","span_id":"72a7b2f1e5fef060","parent_id":"53600ada74b9c296","name":"graphql.validate","resource":"graphql.validate","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"a85edf01-e2ae-444a-80e7-42dd509bf585","language":"javascript"},"metrics":{"_dd.measured":1,"_sampling_priority_v1":1},"start":1663263365477585400,"duration":2002930,"service":"dd-trace-fastify-app","type":"graphql"},{"trace_id":"53600ada74b9c296","span_id":"7268443289046ab9","parent_id":"53600ada74b9c296","name":"graphql.execute","resource":"{add(x:0,y:0)}","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"a85edf01-e2ae-444a-80e7-42dd509bf585","graphql.operation.type":"query","language":"javascript"},"metrics":{"_dd.measured":1,"_sampling_priority_v1":1},"start":1663263365479854600,"duration":6032471,"service":"dd-trace-fastify-app","type":"graphql"},{"trace_id":"53600ada74b9c296","span_id":"235067e9eec2a400","parent_id":"7268443289046ab9","name":"graphql.resolve","resource":"add:Int","error":0,"meta":{"service":"dd-trace-fastify-app","runtime-id":"a85edf01-e2ae-444a-80e7-42dd509bf585","graphql.field.name":"add","graphql.field.path":"add","graphql.field.type":"Int","language":"javascript"},"metrics":{"_sampling_priority_v1":1},"start":1663263365485286100,"duration":390625,"service":"dd-trace-fastify-app","type":"graphql"}]

But then on the second request (now that the query is JIT-ed), you'll see that no graphql spans are being reported:

Encoding payload: [{"trace_id":"022f9c2620baa312","span_id":"022f9c2620baa312","parent_id":"0000000000000000","name":"fastify.request","resource":"GET /","error":0,"meta":{"_dd.p.dm":"-0","service":"dd-trace-fastify-app","runtime-id":"a85edf01-e2ae-444a-80e7-42dd509bf585","http.url":"http://localhost:3000/","http.method":"GET","span.kind":"server","http.useragent":"curl/7.79.1","http.client_ip":"127.0.0.1","http.route":"/","http.status_code":"200","language":"javascript"},"metrics":{"_dd.agent_psr":1,"_dd.top_level":1,"_dd.measured":1,"_sampling_priority_v1":1},"start":1663263394892011500,"duration":4963379,"service":"dd-trace-fastify-app","type":"web"}]

You can also see that it's related to the jit process by setting the jit value here to different values. For instance, if you set it to 5 (JIT after 5 requests), you'll see the spans reported 5 times and on the 6th request, they disappear.

Comments

Workaround

In the meantime, I was able to capture the body of the request by adding a context method to the Mercurius config:

app.register(mercurius, {
  schema,
  resolvers,
  context: (req) => {
    const span = datadog.tracer.scope().active();
    if (span) {
      span.setTag("http.body", req.body);
    }

    return req;
  },
  jit: 1,
});

This at least allows me to see the query/mutation body, which is very helpful for debugging. Before this change, all I could see was POST / graphql.

Qard commented 2 years ago

We don't currently have instrumentation for mercurius or graphql-jit so we'll likely need to schedule some work to write instrumentation for that.