aws / aws-xray-sdk-node

The official AWS X-Ray SDK for Node.js.
Apache License 2.0
271 stars 155 forks source link

Very high CPU usage (at some point) with Postgres' capture #307

Open sroze opened 4 years ago

sroze commented 4 years ago

Hi there,

We have been using X-Ray for a few weeks now and we have had a few issues with some of our Node processes using 100% of CPU usage for no apparent reason.

Details of the issue

CPU and memory usage increases abruptly. Screenshot 2020-06-12 at 17 32 28 Screenshot 2020-06-12 at 17 33 05
In the logs, the `trace_id` of the log itself (that we get with X-Ray's `getCurrentSegment()`) is different from the one in the log. Screenshot 2020-06-12 at 17 40 24
Profiling the Node process, we can see that X-Ray functions are the ones taking all the CPU. On a ~11seconds profile, this is the result. Screenshot 2020-06-12 at 17 16 58

Questions

sroze commented 4 years ago

Digging more in the problem, it seems like we are able to reproduce it when a request is producing many many segments.

willarmiros commented 4 years ago

Hi @sroze, Thank you for raising this issue, it definitely seems to be the result of a high load of (sub)segments. Regardless, it should not be consuming the amount of memory or CPU that it is.

Digging more in the problem, it seems like we are able to reproduce it when a request is producing many many segments.

Do you mean there are many subsegments (e.g. several downstream subsegments being recorded on a single segment) or actually several top-level segments? Based on the Segment Too Large errors I believe it would be the former.

My initial guess is that your segments are getting a bunch of subsegments attached to them which makes them too large to send, and then because they fail to send they're somehow left stale in cls-hooked, our context management system, causing memory leaks. Not really sure if this would cause the CPU usage, maybe that's the repeated attempts to serialize a bunch of 64kb segments into JSON?

The trace ID discrepancy is VERY interesting, that gives further evidence for an error in cls-hooked as well. I'll investigate the SegmentTooLarge failure mode some to see if memory leaking is possible. The auto-context function mentioned in the profiler also manipulates the cls-hooked context.

sroze commented 4 years ago

Do you mean there are many subsegments (e.g. several downstream subsegments being recorded on a single segment) or actually several top-level segments? Based on the Segment Too Large errors I believe it would be the former.

Indeed, the former. Maybe subsegments on a single top-level segment.

Not really sure if this would cause the CPU usage, maybe that's the repeated attempts to serialize a bunch of 64kb segments into JSON?

Is there any sort of process that tries to re-send (more specifically re-serialize) sub-segments if somehow they failed to be sent maybe? This could explain why it's consuming that much CPU because it would attempt many times to serialise the same thing over and over.

It seems like it happens when there is a lot of SQL requests in, this might help your investigation. Worth noting that we have a query logger that does attach the SQL string using the following wonderful strategy:

  try {
    const segment: any = await AWSXRay.getSegment();
    if (segment.subsegments) {
      // get latest subsegment
      const sqlSub = segment.subsegments[segment.subsegments.length - 1];

      sqlSub.sql.url = query;
    }
  } catch (e) {
    log.warn({ query }, 'Unable to attach query to trace', e);
  }

This might contribute to a larger payload, obviously as some queries can be pretty long given we are using an ORM which would list all the fields.

The trace ID discrepancy is VERY interesting, that gives further evidence for an error in cls-hooked as well

I can also confirm that this happens for the 5 cases we have seen so far.

We also have seen that the "streaming threshold" configuration mentions this error. We have configured it to 5 so I guess this is fairly low already; shall we attempt to lower it down even further?

Also, to give you slightly more context, we do capture more or less everything:

  capturePostgres(require('pg'));
  AWSXRay.captureHTTPsGlobal(require('http'), false);
  AWSXRay.captureHTTPsGlobal(require('https'), false);
  AWSXRay.capturePromise();

If the CPU profile dump can help you, happy to send it over email.

willarmiros commented 4 years ago

Is there any sort of process that tries to re-send (more specifically re-serialize) sub-segments if somehow they failed to be sent maybe?

There's no retries for sending or serialization, since the segments are sent via UDP and sampled anyway we accept some lossiness over the cost of retries. The segment_emitter is where we stream them and this is how we serialize them. Just from the profiler output it seems like we might unknowingly be doing some unnecessary double serialization, but that shouldn't impact

How many of these subsegments are you creating for a single request? If you could also send or attach an example trace that couldn't get sent, or even just one of its subsegments, that could be very useful. It would also be nice to see the hierarchy of the subsegments to understand how useful the streaming threshold would be.

We also have seen that the "streaming threshold" configuration mentions this error. We have configured it to 5 so I guess this is fairly low already; shall we attempt to lower it down even further?

Unfortunately as pointed out in #283, the setStreamingThreshold has a bug for streaming subsegment trees right now so I don't think it would assist.

Do you know if you or any packages you depend on are doing any long-term polling processes in the background? Sometimes with captureHTTPsGlobal enabled those processes can accrue lots of subsegments over time which occasionally causes segmentTooLarge errors. If all/vast majority of the subsegments are SQL (as opposed to HTTP) related though, this is likely not the issue.

If the CPU profile dump can help you, happy to send it over email.

That would be great, you can reach me at: armiros (at) amazon.com.

sroze commented 4 years ago

How many of these subsegments are you creating for a single request? If you could also send or attach an example trace that couldn't get sent, or even just one of its subsegments, that could be very useful. It would also be nice to see the hierarchy of the subsegments to understand how useful the streaming threshold would be.

Far too many, clearly. In the latest instance where we managed to reproduce the issue, 3,700 SQL requests were sent to X-Ray, I suspect many more to have happened but not sent because of the numerous Segment too large issues. Note that this clearly should not have been engineered this way but because we create segments for workers too, so we expect such things to happen elsewhere. It seems like this amount of sub-segments is way over what X-Ray's user interface can cope with it too (it takes 30s+ to render), so maybe we should stop X-Ray'ing after a certain amount of sub-segments?

Do you know if you or any packages you depend on are doing any long-term polling processes in the background? Sometimes with captureHTTPsGlobal enabled those processes can accrue lots of subsegments over time which occasionally causes segmentTooLarge errors. If all/vast majority of the subsegments are SQL (as opposed to HTTP) related though, this is likely not the issue.

No, we don't have such things.

That would be great, you can reach me at: armiros (at) amazon.com.

Just did, with the raw X-Ray trace I mentioned above.

sroze commented 4 years ago

because they fail to send they're somehow left stale in cls-hooked, our context management system, causing memory leaks.

Your intuition is probably confirmed by the following graphs. We have isolated the faulty container (in orange) at 15:50 UTC (i.e. did not receive new HTTP traffic) and the 100% CPU usage continued until 23:25 UTC. This is also when the memory usage dropped by ~0.5G. Equally, we can see that there is still plenty of memory being used which probably should have been garbage collected.

Screenshot 2020-06-13 at 09 14 17 Screenshot 2020-06-13 at 09 13 56

Again, if that could help your investigation, we also took a Memory heap snapshot.

willarmiros commented 4 years ago

We will continue to investigate this. As a shot in the dark, are you using the AWS_NODEJS_CONNECTION_REUSE_ENABLED environment variable, and if so could you try disabling it and seeing if the same behavior is exhibited?

sroze commented 4 years ago

@willarmiros no, we don't have this environment variable enabled actually, sorry.

artysidorenko commented 4 years ago

We will continue to investigate this. As a shot in the dark, are you using the AWS_NODEJS_CONNECTION_REUSE_ENABLED environment variable, and if so could you try disabling it and seeing if the same behavior is exhibited?

Hi again @willarmiros, I also work with @sroze - we do actually have this AWS_NODEJS_CONNECTION_REUSE_ENABLED setting enabled (I added it as part of an unrelated exercise, when we moved to using IAM roles for svc accounts on our Kubernetes cluster). We'll run some tests with it disabled and let you know how it goes.

willarmiros commented 4 years ago

Hi @artysidorenko - Have you had a chance to look into this yet?

jmattiace commented 4 years ago

We are seeing a similar issue where we are seeing many Segment too large to send errors and subsequently our CPU is averaging around 60%. When we set the setStreamingThreshold to 1, the errors are gone and we have a normal CPU % again.

I'm not sure if this is the proper long term solution but it's at least getting us by for now. For some insight, we have an Express application where we kick off a 5 second setInterval to run a process indefinitely from within one of our routes. So I'm assuming that is causing the error we are seeing, perhaps because the parent segment is never closing?

willarmiros commented 4 years ago

Hi @jmattiace, It seems like that route with the indefinite process is probably not a good candidate for automatic tracing - so maybe you should define that route outside of the openSegment and closeSegment middleware, or manually call segment.close() segment.flush() to avoid it never closing.

However even with the segment too large errors, you should not be seeing that kind of CPU usage either. There has to be some sort of failed garbage collection when the segment too large error occurs.

artysidorenko commented 4 years ago

Apologies for the super-late reply. I had moved teams and wasn’t really working much with Xray anymore.

I ran a couple of tests myself last week - locally, so the environment wasn’t exactly the same, though still using a daemon connected to AWS etc - but didn’t find that changing the TCP re-use env variable made any difference to the CPU consumption. (I did notice CPU and memory consumption shot up when the connection to the daemon gets dropped while processing traces, but I don't think that happened in the original issue).