aws / aws-sdk-go-v2

AWS SDK for the Go programming language.
https://aws.github.io/aws-sdk-go-v2/docs/
Apache License 2.0
2.66k stars 636 forks source link

DynamoDB returns operation error `use of closed network` #1825

Closed satorunooshie closed 3 months ago

satorunooshie commented 2 years ago

Describe the bug

SDK often returns the error below after updated SDK version.

acutal error:

operation error DynamoDB: Query, $tableName: https response error StatusCode: 200, RequestID: 123456, deserialization failed, failed to decode response body, read tcp $src->$dst: use of closed network connection

Expected Behavior

No errors

Current Behavior

operation error DynamoDB: Query, $tableName: https response error StatusCode: 200, RequestID: 123456, deserialization failed, failed to decode response body, read tcp $src->$dst: use of closed network connection

Reproduction Steps

    ddb := dynamodb.NewFromConfig(defaultCfg)
    result, err := ddb.Query(
        ctx,
        &dynamodb.QueryInput{
            TableName:                 aws.String(tableName),
            ExpressionAttributeNames:  expr.Names(),
            ExpressionAttributeValues: expr.Values(),
            KeyConditionExpression:    expr.KeyCondition(),
        },
    )

Possible Solution

No response

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

    github.com/aws/aws-sdk-go-v2 v1.16.8
    github.com/aws/aws-sdk-go-v2/config v1.15.15
    github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.9.8
    github.com/aws/aws-sdk-go-v2/feature/dynamodb/expression v1.4.14
    github.com/aws/aws-sdk-go-v2/service/dynamodb v1.15.10
    github.com/aws/smithy-go v1.12.0

Compiler and Version used

1.19

Operating System and version

Linux

RanVaknin commented 2 years ago

Hi @satorunooshie ,

I saw that you opened 2 issues but Im going to answer them both here:

The reason you are seeing a status 200 followed by a serialization error is that the DynamoDB server is able to send back a valid response, and before the client is finished reading the body of the successful response, the connection errors out. This can happen for many many reasons and is difficult to pinpoint especially when the pattern is inconsistent like in your case.

Regarding your other ticket and PR: We cannot accept your PR to override the Retryer behavior because not every API operation may be idempotent. The service API models don't have information to determine if an operation is idempotent, thus the SDKs cannot assume if it's safe to retry.

Proposed solution: You can implement your own Retryer on the client level, or on operation level and configure it to retry when the connection fails.

Let me know if you have any other questions and I can answer them here 🙂.

Thanks you! ~Ran

github-actions[bot] commented 2 years ago

This issue has not received a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.

SaikyoSaru commented 2 years ago

We are seeing the same issue as well, can we please resurrect this issue?

psmarcin commented 2 years ago

I was able to fix this issue but wasn't able to reproduce this error 100% times so I can't create PR.

Before deploying my fix we had ~6 errors per minute, after deploy there is no errors.

Here is my fork of smithy-go: https://github.com/psmarcin/smithy-go there is only one small change - remove conditional body closing.

You can give it a try by simple adding to go.mod file:

replace github.com/aws/smithy-go => github.com/psmarcin/smithy-go v0.0.0-20220923084642-9e31178f1156
cwd-nial commented 1 year ago

@RanVaknin or any other contributor, can you review the solution psmarcin is suggesting? We are having the same issue that is mentioned above:

operation error DynamoDB: Query, https response error StatusCode: 200, RequestID: 1263a6b7-ebec-4a51-9594-b647545da4fa, deserialization failed, failed to decode response body, read tcp 172.17.0.12:36206->172.17.0.15:8000: use of closed network connection\n

RanVaknin commented 1 year ago

Hi everyone,

Did not see the action on this thread after it closed. My last attempt at reproducing was not fruitful but I will take another look at it and look at the fix @psmarcin suggested.

Thank you, Ran~

psmarcin commented 1 year ago

@RanVaknin I run my fork in high load environment without any problem for about 3 months now. Would love to know the root cause.

cwd-nial commented 1 year ago

In our case, it is a weird combination of the smithy version with the golang version that causes the issue: go 1.19 + smithy 1.10.0 = OK go 1.19 + smithy 1.11.1 = fails sometimes go 1.19 + smithy 1.13.5 = fails sometimes

The behavior is not consistent, sometimes it fails sometimes it does not, so this is not a proven OK or fail combination, but before our uprade from golang 1.17 to 1.19 we had no problems (and were using smithy v1.13.4...)

So anyways, it looks to me like @psmarcin's fix t throught out the closing block could work, but what would the downsides of this missing close statement be?

In the original commit in which it came in, it was written:

// HTTP RoundTripper *should* close the request body. But this may not happen in a timely manner.
// So instead Smithy *Request Build wraps the body to be sent in a safe closer that will clear the
// stream reference so that it can be safely reused.
if builtRequest.Body != nil {
    _ = builtRequest.Body.Close()
}

So not sure what effect this removal would have...

sugymt commented 1 year ago

We probably have same problem on SQS DeleteMessage. (Request ID and source IP are masked)

operation error SQS: DeleteMessage, https response error StatusCode: 200, RequestID: xxxx-xxxx-xxxx-xxxx-xxxx, deserialization failed, failed to discard response body, read tcp xxx.xxx.xxx.xxx:xxx->18.183.37.45:443: use of closed network connection

go 1.19.3, aws-sdk-go-v2 v1.17.2

a-h commented 1 year ago

I've seen this issue sporadically (4-5 times) in the last few weeks in a failing unit test which connects to DynamoDB Local running in Docker, but I haven't seen it in production.

https response error StatusCode: 200, RequestID: 94f949c6-46fe-47ad-9a8a-6227339b616c, deserialization failed, failed to 
decode response body, read tcp 172.18.0.2:57392->172.18.0.3:8000: use of closed network connection

I have tried to reproduce it locally, but no success. I've tried multithreading calls to DynamoDB local, and stressing it etc.

In production, I've got 1M+ daily DynamoDB calls using github.com/aws/aws-sdk-go-v2/service/dynamodb v1.17.8 running without issue.

iwata commented 1 year ago

I am using this package to use AWS Rekognition and have the same error. I have the latest version, when will this be a fix? Or is there any workaround?

    github.com/aws/aws-sdk-go-v2 v1.17.3
    github.com/aws/aws-sdk-go-v2/config v1.18.8
    github.com/aws/aws-sdk-go-v2/service/rekognition v1.23.0
    github.com/aws/smithy-go v1.13.5
cwd-nial commented 1 year ago

you can fork smithy-go with a small fix that outcomments a three-line block to avoid closing the request body too early, just as @psmarcin did, and we are also doing now for our production services now - with no problems at all.

https://github.com/psmarcin/smithy-go https://github.com/cwd-nial/smithy-go

Maybe as @a-h commented, does the issue just happen when connecting to DynamoDB local running in Docker?

iwata commented 1 year ago

I found that I could retry, so I was able to avoid this by setting RetryMaxAttempts. https://github.com/aws/aws-sdk-go-v2/pull/1826

isaiahvita commented 1 year ago

hello all,

thanks for the engagement, unfortunately, I have not been able to reproduce the issue. The original hypothesis was that this error was appearing after successful responses where deserialization of the response was taking very long which allowed the opportunity for something to cause the closed network error to occur. I tried hitting DDB with maximum response bodies as well as artificially injecting long wait times during deserialization with no luck reproducing. It would be helpful if any of yall seeing this error could provide as much detail as possible when this error occurs: what are the characteristics of your data? did this error start occurring after a sudden change? and anything else you might think is relevant etc.

Additionally, I looked into proposed solution by @psmarcin, which is to remove the following code

    if builtRequest.Body != nil {
        _ = builtRequest.Body.Close()
    }

its helpful to note that this code closes the request body in the event that the underlying Transport RoundTripper of the Go http client failed to already close it. https://github.com/aws/aws-sdk-go-v2/blob/main/service/dynamodb/api_client.go#L279 https://github.com/golang/go/blob/master/src/net/http/request.go#L185 https://github.com/golang/go/blob/master/src/net/http/client.go#L62

so, the idea being that the above code in smithy-go prematurely closes the connection seems a bit dubious because at the point in which that code is executing, the underlying HTTP Client should have already closed it. all this to say that when response deserialization happens, it seems that the request body should have already been closed under normal circumstances. This is not to rule this out as a potential fix but would need a bit more investigation.

additionally, it would also be helpful, if any of the effected can explain why a proper retry strategy would not fit your use case (as suggested by @RanVaknin above)?

github-actions[bot] commented 1 year ago

This issue has not received a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.

mikkael131 commented 1 year ago

Just leaving a comment here to confirm that this problem still exists with the following versions and a local running dynamodb docker instance in gitlab:

Go v1.20.3 github.com/aws/aws-sdk-go-v2 v1.17.7 github.com/aws/aws-sdk-go-v2/config v1.18.2 github.com/aws/aws-sdk-go-v2/credentials v1.13.2 github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.10.19 github.com/aws/aws-sdk-go-v2/service/dynamodb v1.19.2 github.com/aws/smithy-go v1.13.5 amazon/dynamodb-local:1.21.0

A retry strategy cannot always be used to retry this error, due to the same reason you are not adding this as a retryable error on SDK level: e.g. adding an element to a list in dynamodb with an update expression is not idempotent and cannot be retried since we do not know if the request was successful or not.

Currently our BDD test suite failes about ~5% of times in the gitlab pipeline because of this error, which is mildly annoying. If our tests are failing then it's usually in about the same scenario and steps, so I could assume it could be related to a lifecycle of a connection. Either from client or from local dynamodb perspective.

To me it seems that this is happening when the running system is under heavy load.

lucix-aws commented 1 year ago

As far as the error being returned is concerned - the SDK is behaving correctly by not retrying in this case. As people have noted in this thread and others, we DO NOT retry this class of error (ECONNRESET) because it is not deterministically safe to do so.

As for the root case, we have no hand in the networking internals of SDK requests - we defer to the configured HTTP client, which is either the standard library implementation in the default case, or one provided by the user.

jcarter3 commented 1 year ago

This is just one of a thousand paper cuts caused by use of smithy generated code. We've already had to "downgrade" a few services to using aws-sdk-v1 to work around a bunch of s3 edge cases - Is downgrading the preferred solution for all of these issues?

BananZG commented 8 months ago

just upgraded my project from sdk v1 to latest sdk v2, started seeing this error.

github.com/aws/aws-sdk-go-v2/service/dynamodb v1.28.1
github.com/aws/smithy-go v1.20.0
drshriveer commented 8 months ago

We also are having significant developer pain when running tests with Github Actions. Once we reach a critical mass of tests, they begin to fail consistently not intermittently. As others have suggested, we tried forking the repo and commenting out the suggested block of code. I also concluded (as did @isaiahvita) that the commented-out block should not have an impact on this issue. But with nothing else to go off, it was worth a try. And removing removing the block of code did stop failures (at least until we built up another critical mass of tests-- so for a month or so, a beautiful, consistent month).

I spent more time attempting to debug the issue in smithy itself as I had a path to reproduce it 100% of the time and quickly learned that any extra infrastructure I added to help debug (with or without the commented block), tended to "fix" the issue.

To me this seems to indicate there is a real race condition somewhere in the code and that simply adding or subtracting work seems to shuffle load just enough that the is less likely to occur. The fact that the issue only occurs after upgrading to sdk2 should also be a clear sign of this. Alas, I did not have time to continue my investigation and now the error is back.

Please re-open the issue.

zapisanchez commented 8 months ago
go 1.21
github.com/aws/aws-sdk-go-v2 v1.25.2
github.com/aws/aws-sdk-go-v2/config v1.27.5
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.13.7
github.com/aws/aws-sdk-go-v2/feature/dynamodb/expression v1.7.7
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.30.2
github.com/aws/aws-sdk-go-v2/service/s3 v1.51.2
github.com/aws/smithy-go v1.20.1

Well.... other here. It looks incredible that in 2 years there is no an official fix.

we are facing same problem. We are trying to read the body response, but we got an "use of closed network connection" error instead.

It's clear that closing the body, as @psmarcin and @isaiahvita indicates, causes a race condition.

It would be nice to propose a solution better htan make our own fork commenting those lines

stan-stately commented 7 months ago

I'm also seeing this. it is making my CI basically unusable

RanVaknin commented 7 months ago

Hi everyone,

Since more and more people are reporting being impacted I have re-opened this issue.

In our previous reproduction attempts we were not able to recreate the reported behavior. The next step is to ask one of the customers here on the thread to provide a minimal github repository with code that can reliably reproduce the behavior (even if it means seeing intermittent errors reliably).

That would give us a little more to go off of.

Thanks again, Ran~

CurryFishBalls9527 commented 7 months ago

we are also seeing this issue happen intermittently in all of our environments.

mcblair commented 7 months ago

We have been seeing this intermittently in all of our environments too. I'm curious if in the short term, avoiding concurrent use of the ddb client would reduce the frequency of occurrence. Will try to yield a minimal example to reproduce the behavior, but not sure when we can.

csepulveda commented 6 months ago

I got the same issue using github.com/aws/aws-sdk-go-v2/service/sqs v1.31.4. the issue shows up after we upgrade from sdk-v1 to sdk-v2.

RanVaknin commented 6 months ago

Hi @CurryFishBalls9527 , @mcblair , @csepulveda ,

Please provide a complete reproduction code so we can re-attempt to reproduce this behavior.

https://github.com/aws/aws-sdk-go-v2/issues/1825#issuecomment-2040667764

Thanks, Ran~

lucix-aws commented 5 months ago

To those affected by this issue, 2 things--

retrying these errors

You can make the underlying error net.ErrClosed retryable in your client with the following:

type retryErrClosed struct {
    aws.Retryer
}

func (r *retryErrClosed) IsErrorRetryable(err error) bool {
    return errors.Is(err, net.ErrClosed) || r.Retryer.IsErrorRetryable(err)
}

svc := s3.NewFromConfig(cfg, func(o *s3.Options) {
    o.Retryer = &retryErrClosed{o.Retryer}  
})

This should be safe to do for get-type operations such as dynamodb.Query.

root-causing

As of this writing we have been unable to reproduce this in-house whatsoever. I'm going to ask that anyone affected by this issue implement some basic net.Conn tracing to help grant more visibility into where the underlying Close() calls are coming from. The following DialContext implementation, when provided to a transport, will save the stack trace on a Close() call, and dump it to stdout when a Read() on the underlying conn catches this.

If you are affected by this issue and are in a position to do so, please implement this tracing logic in your clients and share any of the stack traces produced from it:


import (
    awshttp "github.com/aws/aws-sdk-go-v2/aws/transport/http"
)

type dialContext func(ctx context.Context, network, addr string) (net.Conn, error)

func traceConn(dc dialContext) dialContext {
    return func(ctx context.Context, network, addr string) (net.Conn, error) {
        conn, err := dc(ctx, network, addr)
        return &tracedConn{Conn: conn}, err
    }
}

type tracedConn struct {
    net.Conn
    closeTrace string
}

func (c *tracedConn) Read(p []byte) (int, error) {
    n, err := c.Conn.Read(p)
    if len(c.closeTrace) > 0 {
        fmt.Println()
        fmt.Println("!!! READ ON CLIENT-CLOSED CONN !!!")
        fmt.Println(c.closeTrace)
        fmt.Println("!!! -(end trace)-------------- !!!")
        fmt.Println()
    }
    return n, err
}

func (c *tracedConn) Close() error {
    c.closeTrace = string(debug.Stack())
    return c.Conn.Close()
}

func main() {
    // ...

    svc := s3.NewFromConfig(cfg, func(o *s3.Options) {
        client := o.HTTPClient.(*awshttp.BuildableClient)
        o.HTTPClient = client.WithTransportOptions(func(t *http.Transport) {
            t.DialContext = traceConn(t.DialContext)
        })
    })

    // ...
}
minhquang4334 commented 4 months ago

I'm also seeing this when using aws-sdk-go-v2 v1.27.2 and aws-sdk-go-v2/service/dynamodb v1.32.8

cenapguldh commented 4 months ago

We also see this errors (but in SQS ReceiveMessage) in all of our environments.

operation error SQS: ReceiveMessage, https response error StatusCode: 200, RequestID: xxxx-xxx-xxxxx--xxxx, deserialization failed, failed to decode response body, read tcp src->dst: use of closed network connection

github.com/aws/aws-sdk-go-v2/service/sqs v1.33.1 github.com/aws/aws-sdk-go-v2 v1.30.0 github.com/aws/smithy-go v1.20.2

RanVaknin commented 3 months ago

Hi @iwata @csepulveda @a-h @CurryFishBalls9527 @drshriveer @cenapguldh @minhquang4334 @satorunooshie @cwd-nial @mikkael131 @BananZG @mcblair @sugymt @SaikyoSaru @zapisanchez @jcarter3 @psmarcin @drshriveer and anyone else I might have missed here.

First off, thank you all for your continued patience and detailed contributions to this thread. We recognize that this issue has been persistent and challenging.

Despite our efforts, we still haven’t been able to reproduce this issue internally, which complicates our ability to pinpoint the exact cause and develop a fix. Both @lucix-aws and I have requested additional information to aid in this investigation, but it seems we need to make these requests more visible.

To ensure that all necessary details are clearly visible and to help guide new people running into this issue, we are going to lock this thread, and I'm also going to pin it. We ask that anyone still experiencing this problem to please open a new issue and include all the details outlined in @lucix-aws’s recent comment. This information is crucial for us to attempt a detailed reproduction of the problem.

Thank you once again for your efforts and cooperation. Ran~