aws / aws-xray-sdk-node

The official AWS X-Ray SDK for Node.js.
Apache License 2.0
268 stars 155 forks source link

Client-side timeouts requests are not sent to X-Ray #446

Open SebC99 opened 3 years ago

SebC99 commented 3 years ago

Same as: https://github.com/aws/aws-xray-sdk-go/issues/289

When a request is too long to process by the server, and the client (or load balancer) closes the connexion because of a timeout, X-Ray does not send any info on Xray, which means we can't see what took so long during these requests...

willarmiros commented 3 years ago

Hi @SebC99, Can you give some repro JS code for a traced client-side request which fails to record anything due to a timeout? Timeouts should be recorded as a normal client-side 4xx error on the subsegment and recorded as part of the trace. If the timeout raises an unchecked error then the segment will fail to close and the trace may not be recorded, but otherwise I'd agree this is a bug.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in next 7 days. Thank you for your contributions.

thetumper commented 2 years ago

We have observed something similar. Our lambda has an SQS event source with a batch size > 1. For each event in the batch, we call out to another API. If one of them is long-running, we have a code-based "time out" in our lambda so that we manually delete all the successful SQS messages in the batch, instead of letting the lambda time out due to the 1 that is long-running and causing the whole batch to be put back for redrive. Because our lambda throws an error for the one that is "timed out" (after deleting the other messages), X-Ray does not record the outbound call that was long-running.

laireyang commented 2 years ago

Any update on this? @willarmiros I have bumped into a similar issue. It seems like X-Ray only shows API calls that have returned responses. So if a connection is taking too long and causes the lambda function to time out, this API call is actually not reflected in X-Ray. Example script:

import requests
from aws_xray_sdk.core import patch_all

patch_all()

def lambda_handler(event, context):
    response = requests.get("http://httpstat.us/200")
    print(f"response status code: {response.status_code}")
    # This API call will time out the lambda as this lambda function is set to time out after 3 secs
    response = requests.post("http://httpstat.us/202?sleep=3000")
    print(f"response status code: {response.status_code}")

and this is what we see on the X-Ray trace: image-20220405-145824

Is there a way to make X-ray show those API calls that are taking too long to process and causing lambda to time out?

willarmiros commented 2 years ago

Hi @laireyang,

Unfortunately this is not possible today, because when a Lambda function times out the function runtime is immediately frozen meaning no further operations are possible, so once the timeout occurs it's too late to send out additional subsegment data. Lambda is working on an extension to improve this experience, but for now I'll label this as an enhancement request.

clarkmcc commented 2 years ago

This behavior is actually pretty frustrating, considering it's the invocations that are timing out that I'd most like to see traces for.

rmorshea commented 2 years ago

I'm experiencing something similar while working with the AWS Distro for OpenTelemetry.

rdsedmundo commented 1 year ago

Any updates from the Lambda team on this?

I completely agree with this:

considering it's the invocations that are timing out that I'd most like to see traces for

It's literally the whole reason we just added X-Ray to our application, and it helped with nothing because of this limitation.

rdsedmundo commented 1 year ago

@willarmiros doesn't the new AWS Lambda Telemetry API helps at all here? It looks like Lumigo is leveraging that to detect timeouts.

savage-alex commented 11 months ago

Hi, any updates on this feature request please?

wangzlei commented 10 months ago

Unfortunately there is no progress yet, the behavior is caused by Lambda environment. When Lambda timeout, it immediately freeze the sandbox so the ongoing client span is blocked in memory. This span will be flushed if the next Lambda invocation hits the sandbox. Else the span would be lost when the Lambda sandbox close.

We would like get Lambda's attention for this issue and wait for AWS Lambda provide the solution.