Low level details about outbound HTTP requests

jpage-godaddy commented 2 years ago

One challenge with understanding performance issues in Elastic APM is that outbound HTTP requests are shown as a simple bar representing start and end time, but that omits many details. Additional details like how much of that request was DNS lookup, connection time, SSL handshake, and request body upload and response body download would help us identify networking and environmental issues. New Relic has this level of granularity, and we can see it in browser instrumentation; it'd be nice to have it in Elastic APM as well.

astorm commented 2 years ago

Hey @jpage-godaddy, thanks for dropping by.

Two quick questions

Do you happen to have the metric names or a screenshot of the information that New Relic captures here?
Also -- do you know if New Relic is capturing this in their Browser/RUM Agent, their Node.js agent, or both?

With the above information we'll be better prepped to explore this feature request. Thanks!

jpage-godaddy commented 2 years ago

We no-longer use New Relic so I cannot get a screenshot, but if I remember correctly you could drill down into an outbound HTTP request and see it broken down into different segments, much like what you see in browser dev tools. We were able to see segments like connection time, DNS lookup time, etc. We also only used the node.js agent, never the browser, so I'm unsure if they're capturing it in the browser (though I imagine they do since it's pretty easy with browser APIs to get this info).

astorm commented 2 years ago

@jpage-godaddy Thanks for those clarifications., they were helpful. Looking at some New Relic screenshots, is this the feature you're remembering, or did I grab the wrong one?

transaction-trace

(screen shot shows an HTTP request being made to example.com from an express handler)

jpage-godaddy commented 2 years ago

Yep, exactly that, @astorm.

astorm commented 2 years ago

Thanks @jpage-godaddy -- feature request acknowledged. :)

One last question -- I presume this request is coming from a desire to have back an old feature from your prior vendor. That said, is there a specific use case or problem you're trying to solve here? If so please let us know -- those kind of use cases can often help us move something up the "get it done" ladder.

astorm commented 2 years ago

Technical Details: This definitely is New Relic's Node Agent and has nothing to do with the browser/RUM agent. The data comes from New Relic wrapping more methods to generate spans in the following places (a span is, roughly speaking, a segment in their code)

https://github.com/newrelic/node-newrelic/blob/main/lib/instrumentation/core/dns.js

https://github.com/newrelic/node-newrelic/blob/91967dd5cd997aa283b8aa0b2fdacc2a5f10a628/lib/instrumentation/core/net.js

https://github.com/newrelic/node-newrelic/blob/40ad3ce967137f8d11bf0230c7a524e5c1fd5af6/lib/instrumentation/core/http.js#L488

Also -- here's how this looks in New Relic's Distributed Tracing view.

dt-1

By default NR only shows "service to service" spans in their distribution tracing view. You get a transaction for the URL being processed (GET /hello) and a span for the outgoing service request

However NR also has has the concept of an "in process" spans. This is basically "all the spans that were generated" -- if you turn these on you can see the the spans for the HTTP agent, the net module and the dns module (and a bunch of other ones)

dt-in-process

jpage-godaddy commented 2 years ago

Sure. A common performance problem we hit is in the hand-wavey parts between when we make an outbound HTTP request and when the target service being called actually handles the request. We frequently hit issues where the caller will show 100's to 1000's of milliseconds for a request, but when we look in the APM for that API it shows only 10's of milliseconds. Sometimes DNS is the problem (badly configured host environment, etc). Sometimes there are resource limits like max sockets on the side of caller. Sometimes networking/security infrastructure in between endpoints is causing throttling or slowdowns that hamper byte transfer rates or connections. Seeing which part of the request took the longest helps us find the root cause much quicker than having to investigate the entire stack to figure out what the issue is.

astorm commented 2 years ago

Additional Technical Details:

In order to comply with our specs

Exit spans MAY have child spans that have the same type and subtype. For example, an HTTP exit span may have child spans with the action request, response, connect, dns. These spans MUST NOT have any destination context, so that there's no effect on destination metrics.

we'll need an approach that specifically instruments the HTTP request (vs. the generic instrumentation of method calls in net and dns) in order to ensure the new spans have the same type/subtype of the HTTP spans.

One approach for would be generating timings based via event listeners directly on the socket. (hat tip @trentm )

https://blog.risingstack.com/measuring-http-timings-node-js/#measuringhttptimingsinnodejs describes how we could consider getting timings for DNS, connect, TLS phases by instrumenting just the HTTP client request (at least for the core node.js HTTP client).

Additionally -- if we're suppressing the HTTP span under an HTTP service (ex. our approach in the elasticsearch instrumentation) this information will not be captured.

sibelius commented 3 days ago

can we capture http outcoming body and response ?

elastic / apm-agent-nodejs

Low level details about outbound HTTP requests #2595