Azure / azure-sdk-for-js

This repository is for active development of the Azure SDK for JavaScript (NodeJS & Browser). For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/javascript/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-js.
MIT License
2.1k stars 1.21k forks source link

Socket hang up start to show up on 12.9.0 #24904

Open zhangxin511 opened 1 year ago

zhangxin511 commented 1 year ago

Describe the bug We start to see socket hang up issue. Some error example:

{"message"=>"request to https://frsprodwestus2001.blob.core.windows.net/57c8dc81-6b68-4dab-8a58-01323b5c35ed?restype=container failed, reason: socket hang up", "type"=>"system", "errno"=>"ECONNRESET", "code"=>"ECONNRESET", "name"=>"FetchError", "stack"=>"FetchError: request to https://frsprodwestus2001.blob.core.windows.net/57c8dc81-6b68-4dab-8a58-01323b5c35ed?restype=container failed, reason: socket hang up\n    at ClientRequest.<anonymous> (/usr/src/server/bundle/www.js:73174:18)\n    at ClientRequest.emit (events.js:400:28)\n    at TLSSocket.socketOnEnd (_http_client.js:499:9)\n    at TLSSocket.emit (events.js:412:35)\n    at endReadableNT (internal/streams/readable.js:1333:12)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)"}
{"message"=>"request to https://frsprodwestus2001.blob.core.windows.net/1e7260d5-247e-4203-ab22-5d7d6ca4b86d/49c4d096-780f-4143-979e-36b4b2380ddf%2Fobjectstore%2FSequencedOperation%2F61c92b5c-76ae-4b42-9a24-8940591bfd76 failed, reason: socket hang up", "type"=>"system", "errno"=>"ECONNRESET", "code"=>"ECONNRESET", "name"=>"FetchError", "stack"=>"FetchError: request to https://frsprodwestus2001.blob.core.windows.net/1e7260d5-247e-4203-ab22-5d7d6ca4b86d/49c4d096-780f-4143-979e-36b4b2380ddf%2Fobjectstore%2FSequencedOperation%2F61c92b5c-76ae-4b42-9a24-8940591bfd76 failed, reason: socket hang up\n    at ClientRequest.<anonymous> (/usr/src/server/bundle/www.js:73174:18)\n    at ClientRequest.emit (events.js:400:28)\n    at TLSSocket.socketOnEnd (_http_client.js:499:9)\n    at TLSSocket.emit (events.js:412:35)\n    at endReadableNT (internal/streams/readable.js:1333:12)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)"}

To Reproduce Steps to reproduce the behavior:

  1. not replicabl, very minor

Expected behavior From https://stackoverflow.com/questions/16995184/nodejs-what-does-socket-hang-up-actually-mean, this seems meaning server didn't response in time. However we already retried and still see this error happen.

Screenshots We have used the build in retry options and retry 3 time of 0.1, 0.2, 0.4 seconds interval. And I think the default timeout is 30 seconds, which we didn't change. Some sample code: We connect the blob using ami authentication, and only specified the retry option everything else are default

image

We then read the blob by following code:

image

It could be sever are slowing responding, but timeout by default is 30 seconds

Want's to know anything we did wrong?

Additional context Add any other context about the problem here.

jake-wickstrom commented 1 year ago

I am encountering the same issue on version 12.10.0

EmmaZhu commented 1 year ago

Hi @zhangxin511 ,

Current retry only retries on 500 or 503 errors. there's no logic for retrying on timeout.

Seems there can be an improvement about this.

zhangxin511 commented 1 year ago

Thank you @EmmaZhu

Could you list us some non-retry able error contracts and their examples? We will implement some client side retry logic.

EmmaZhu commented 1 year ago

Hi @zhangxin511 ,

We have retries on:

Network errors:

  const retriableErrors = [
        "ETIMEDOUT",
        "ESOCKETTIMEDOUT",
        "ECONNREFUSED",
        "ECONNRESET",
        "ENOENT",
        "ENOTFOUND",
        "TIMEOUT",
        "EPIPE",
        "REQUEST_SEND_ERROR", // For default xhr based http client provided in ms-rest-js
      ];

Server busy or throttling errors: (statusCode === 503 || statusCode === 500)

Partial xml response body: (err?.code === "PARSE_ERROR" && err?.message.startsWith(Error "Error: Unclosed root tag)

Not found on secondary: if (!isPrimaryRetry && statusCode === 404)

If you see the "Socket hung up" error again, could you share the detailed error instance info to us? We'll try to add retry logic for it. Thanks a lot

zhangxin511 commented 1 year ago

Thank you @EmmaZhu , are these retires already retrying on sdk side? Because I see the ECONNRESET already in the retriableErrors list you shared. As you said, the detailed error is

{
     "message"=>"request to https://frsprodwestus2001.blob.core.windows.net/57c8dc81-6b68-4dab-8a58-01323b5c35ed?restype=container failed, reason: socket hang up", 
     "type"=>"system", 
     "errno"=>"ECONNRESET", 
     "code"=>"ECONNRESET",
     "name"=>"FetchError",
     "stack"=>"FetchError: request to https://frsprodwestus2001.blob.core.windows.net/57c8dc81-6b68-4dab-8a58-01323b5c35ed?restype=container failed, reason: socket hang up\n    at ClientRequest.<anonymous> (/usr/src/server/bundle/www.js:73174:18)\n    at ClientRequest.emit (events.js:400:28)\n    at TLSSocket.socketOnEnd (_http_client.js:499:9)\n    at TLSSocket.emit (events.js:412:35)\n    at endReadableNT (internal/streams/readable.js:1333:12)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)"
}

So this might lead to the question I have initially, the error code is ECONNRESET, and even with 3 times retry options it seems still get the same error?

EmmaZhu commented 1 year ago

Hi @zhangxin511 ,

From the stack, seems the failure happens when reading from a readable stream, which I guess happened when reading a download operation's response body.

Our retry logic can only retry on the request, the reading body operation is not in code of download() function, we have no retry for errors in reading response body for now.

Thanks Emma

axdotl commented 11 months ago

Hey there! Is there anything what we / I could do to mitigate the issue? Are there any recommendations how to deal with socket hang up?

Should we write a custom retry logic?