aws / aws-xray-sdk-go

AWS X-Ray SDK for the Go programming language.
Apache License 2.0
280 stars 118 forks source link

All HTTP traces involving timeouts are silently deleted (Lambda worst affected) #289

Open vcschapp opened 3 years ago

vcschapp commented 3 years ago

Issue Type

Bug.

Expected Behavior

I have decorated an http.Client using xray.Client. When I make an HTTP request using the decorated client and the request times out due to a client-side timeout value, an X-Ray trace should be published and it should include the timeout error.

Actual Behavior

No X-Ray trace is published. It is as if the HTTP request never happened at all.

Reproduction Steps

See the example code at bottom for a self-contained Lambda function which repros the problem.

Root Cause (Hypothesis)

The root cause of this problem seems to be that when a timeout error happens (or the HTTP request's root context is cancelled for any other reason), some of the sub-subsegments under HTTPSubsegments are left dangling.

In particular, if the timeout or context cancellation happens during one of the connection setup phases (DNS, TCP connect, TLS handshake) then a sub-sub-segment below the connect sub-segment is left open. For whatever reason, this results in X-Ray

The AWS SDK for X-Ray (Go) code tries to address this issue by calling GotConn in case of error (see in xray/client.go) but the problem is this doesn't address open sub-sub-segments of connect.

Notes

  1. While this issue can affect any application using X-Ray in Go, I found it particularly pervasive while writing Lambda functions in Go because during a Lambda cold start, a TCP connect/TLS handshake often takes a very long time.
  2. I don't actually use the xray.Client code. I am using the AWS X-Ray SDK for Go to write my own X-Ray plugin for a retrying HTTP framework for Go called httpx (my plugin is on GitHub at gogama/aws-xray-http). For my use case, it would be helpful if you simply extend xray.HTTPSubsegments with a new exported method called Error(err error) which just "does the right thing" and walks the sub-segment tree closing off every sub-segment which is open in an appropriate manner.

Example Lambda Function

  1. Create a Lambda function using the Go 1.x runtime.
  2. Use the code below.
  3. Enable X-Ray active tracing for your Lambda function.
  4. Use the following test input: {"url":"https://jsonplaceholder.typicode.com/todos/1","timeout": 100}
  5. Tweak the input timeout and AWS region where you are running your function until you get a timeout. Timeout is more likely during cold start.
  6. Go to the X-Ray console and look at the trace for your request that timed out. Note that there is no HTTP trace at all.
package main

import (
    "context"
    "fmt"
    "io/ioutil"
    "net/http"
    "time"

    "github.com/aws/aws-lambda-go/lambda"
    "github.com/aws/aws-xray-sdk-go/xray"
)

func main() {
    lambda.Start(handler)
}

type event struct {
    URL     string `json:"url"`
    Timeout int64  `json:"timeout"`
}

func handler(ctx context.Context, evt event) (string, error) {
    cl := xray.Client(&http.Client{
        Timeout: time.Duration(evt.Timeout) * time.Millisecond,
    })

    req, _ := http.NewRequestWithContext(ctx, "GET", evt.URL, nil)
    resp, err := cl.Do(req)
    if err != nil {
        fmt.Printf("Error sending basic HTTP request: %q\n", err.Error())
        return "", err
    }
    fmt.Printf("Basic HTTP response status: %d\n", resp.StatusCode)
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        fmt.Printf("Error reading basic HTTP response body: %q\n", err.Error())
        return "", err
    }
    fmt.Printf("Basic HTTP response body: %s\n", string(body))

    return string(body), nil
}
bhautikpip commented 3 years ago

Hi @vcschapp ,

Thanks for sharing detailed repro steps. I think one of the reason why there are no trace sent to x-ray could be because of the timeout you set for your request. I see that you have set request timeout for 100 seconds. You probably have to check your lambda function timeout as well. I am guessing default lambda timeout is 3 seconds. So looks like your lambda function is exited even before your request completes and therefore SDK never gets a chance to close the segment and send it to X-Ray service. I would recommend you to try out request timeout which is less than lambda function timeout and see if you're still encountering this issue. If yes, then I would be happy to help.

vcschapp commented 3 years ago

Hi @bhautikpip, Thanks for such a quick reply.

I can confirm the Lambda function is not timing out. This is a bug in the way the X-Ray SDK is working with the request. (I see the point of confusion was the request timeout. It is set to 100 milliseconds, not seconds!)

I'm able to work around it by forcing the X-Ray SDK to emit the segment with the dangling sub-segments by setting its ContextDone flag to true. However I'm obviously misusing ContextDone and would prefer not to.

dnagir commented 6 months ago

From what I can observe, this seems to be happening on the lambdas where the lambda timeout (~5 mins) is way longer than the HTTP client timeout (60s). Wonder if anyone else is seeing this given how old this issue is?