aws / aws-sdk-js-v3

Modularized AWS SDK for JavaScript.
Apache License 2.0
3.06k stars 573 forks source link

SNS client hanging indefinitely sending PublishCommand #6025

Closed alesk20 closed 4 months ago

alesk20 commented 5 months ago

Checkboxes for prior research

Describe the bug

Hello, I have a problem I can't solve with SNS client. I have a server that receive a big amount of messages from an SQS queue (using SQS client), performs some internal operation and then send a notification with a json message body to an SNS topic, using the sdk method "sns.send" and the argument as instance of the class PublishCommand.

After some hour the server is running, depending on the amount of the data flowing through the sqs consumer, the "sns.send" method begin to hang indefinitely and never respond, and the notification is not being published. I implemented a timeout of 180 seconds to stop the actual execution and retry the publication on the sns topic, and sometimes it works on the 2nd retry, sometimes on the 3rd and so on.

The problem is that as long as other messages are coming through the sqs queue, more and more messages start to have the same problem, until my server is completely blocked and needs to be restarted. After the restart the messages are succesfully elaborated and notifications are correctly published to the topic.

I have this problem only with aws-sdk v3, running aws-sdk v2 I never had this problem and the operations and logic of my server have remained the same. I tried different versions of the @aws-sdk/client-sns, included the last one, and the problem always occurs.

SDK version number

@aws-sdk/client-sns, @aws-sdk/sqs-consumer

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

Node.js 18

Reproduction Steps

const sns = new SNS({apiVersion: "2010-03-31", endpoint: options.endpointUrl}); const publishCommand = new PublishCommand({ ...MessageData, TopicArn: topic }); await sns.send(publishCommand);

Observed Behavior

The command "await sns.send(publishCommand)" hangs undefinitely

Expected Behavior

The "sns.send" command should respond immediately or at least after reasonable time.

Possible Solution

No response

Additional Information/Context

No response

RanVaknin commented 5 months ago

Hi @alesk20,

Thanks for reaching out. The behavior is indeed odd. Since the return value from the await call to .send() is hanging, it might be because the server did not close the connection and the SDK is still awaiting a response.

Without seeing more detailed logs it would be very difficult to diagnose. This could be due to different httphandler defaults with regard to connection management that you might need to change.

For example, in the v2 SDK the default timeout was 60 seconds, in v3 we use the defaults provided by node's http client which is 0:

requestTimeout: The number of milliseconds a request can take before automatically being terminated. Defaults to 0, which disables the timeout. The number of milliseconds a request can take before being automatically terminated.

My guess is that this issue where the server hangs is also happening on v2, but the default behavior of the older version makes this more transparent. You might want to dial down the timeout to be more aggressive , perhaps at 60 seconds to align it with v2's behavior and see if this solves your issue.

Thanks, Ran~

alesk20 commented 5 months ago

Hi @RanVaknin,

thank you for the response. I'll try setting the timeout explicitly to 60 seconds, but it's still strange that all the messages get published with V2 sdk and instead with V3 sdk they don't get published when sns client hangs. Shouldn't also the messages handled with V2 sdk not being published if they reach the default 60 secs timeout? What I observe is that I don't lose any message with V2 sdk but with V3 sdk I lose them when sns client is hanging and I forcefully trhow a timeout.

Thanks

alesk20 commented 5 months ago

Hi @RanVaknin,

I want to add another question after reading your response: in the V2 sdk what happens when the default requestTimeout is reached? An error is thrown or the promise is just resolved?

The timeout of 180 seconds I mentioned in my first message was not set on client-sns, but as external timeout to drop the process and retry, so in my actual implementation, after what you said, I think the connection to SNS topic still hangs even if I drop the process.

It still doesn't explain why V3 sdk has this slowdowns publishing messages to SNS topic, while the V2 sdk delivers them immediately, also under huge pressure, without missing any delivery.

Thanks

RanVaknin commented 5 months ago

Hi @alesk20 , requestTimeout means that the connection will terminated from the client side. It does not mean a retry.

Shouldn't also the messages handled with V2 sdk not being published if they reach the default 60 secs timeout?

Not necessarily, the server might receive and process your request but it might not be responding with the status to inform the client that the message was / wasn't processed.

It's hard to say why you are only experiencing this with v3. It might be because differences in connection management, or something you did differently in your code. Without seeing an end to end example it will be very difficult to root cause this.

Can you set up a minimal github repository that can reliably (intermittently reliably is also ok) reproduce this behavior? Ideally this reproduction would have the working v2, and the non working v3 code so we can compare these as well.

Thanks, Ran~

alesk20 commented 5 months ago

Hi @RanVaknin,

unfortunately it's very difficult to replicate this case, it only happens to me after 1-2 hours and only in production environment, where I have a lot of traffic on the sqs queue. I also tried to replicate it on a test environment myself, but couldn't manage to do it.

As I said in the first message, I didn't change anything on the code, I just migrate V2 sdk to V3 sdk and upgraded Node.js 16 to Node.js 18, these two are the only things I changed. I don't think the problem is Node.js 18 version.

Can you tell me what happens on V2 sdk when default requestTimeout is reached? The promise gets resolved or an error is thrown?

Thanks.

RanVaknin commented 5 months ago

Hi @alesk20 ,

Can you tell me what happens on V2 sdk when default requestTimeout is reached? The promise gets resolved or an error is thrown?

When v2 requestTimeout (or in its v2 name timeout) is reached, the client will kill the connection, and an error would be thrown as shown here: https://github.com/aws/aws-sdk-js/blob/36e3f6d5c27adf522b7517f095f060f4581d9b03/lib/http/node.js#L86. You might be handling it in v2 and not doing so in v3?

As I said in the first message, I didn't change anything on the code, I just migrate V2 sdk to V3 sdk and upgraded Node.js 16 to Node.js 18, these two are the only things I changed. I don't think the problem is Node.js 18 version.

I understand your concern, however I cannot point to a single point in the SDK and say "this is why your code is not working like it did in v2" There is about 8 years of development between when v2 was first introduced to when v3 was released, the architecture of the two is very different and evolved with the JS language itself and the Ecosystem's best practices.

I tried to strip down all of the http configurations used by the v2 SDK and actually have found that the only http option we explicitly override is indeed timeout however I was wrong initially. We actually set it to 120000ms (2 min) by default:

console.log(sns.config.httpOptions)
// prints: { timeout: 120000 }

I don't think it will be helpful for us to keep comparing the two, and instead we should try and focus how to help with your current setup.

Are you running your application from something like a Docker container? I'm asking because Docker has decent support for tcpDump which allows you to inspect TCP level networking events. You could use that, or any other network diagnostic tool to find what closes those connections.

I understand that your current repro code does not raise the reported behavior, but can you please share it anyway? Right now we are doing a lot of theorizing which is not helpful. By you sharing your code we can better visualize the architecture and do a simple visual check of certain things you might be missing to get this to work correctly (this is not to suggest that your code is wrong). If you have the v2 code handy, feel free to share that too.

Thanks again for your cooperation.

All the best, Ran~

github-actions[bot] commented 4 months ago

This issue has not received a response in 1 week. If you still think there is a problem, please leave a comment to avoid the issue from automatically closing.

alesk20 commented 4 months ago

Hi @RanVaknin, with further investigations it seems that the problem resides on node 18 version, which is giving hanging http requests problem in other ways, not only on aws sdk. I will investigate more and try to release my project with node 20, which seems not to have these hanging problems.

alesk20 commented 4 months ago

Hi @RanVaknin, I think I found the problem and it's not with nodejs versions. The problem is with the S3 client of "@aws-sdk/client-s3": I managed to replicate the issue and I see that the sdk is never closing the socket opened with S3 requests and this eventually leads to a bottleneck in the server sockets pool. I think I solved the problem forcing the "requestTimeout" on the S3 client:

const s3 = new S3({ ...options.s3, requestHandler: new NodeHttpHandler({ httpAgent: new Agent({ keepAlive: true, keepAliveMsecs: 1000 }), requestTimeout: 5000 }) });

By doing this, I see that the S3 sockets are being closed after 5 seconds and no connection is hanging. Isn't this a sdk bug? With aws-sdk 2 the connections to S3 were successfully closed automatically after the response.

Kind regards.

RanVaknin commented 4 months ago

Hi @alesk20 ,

I don't know the S3 operation you are using since it was not mentioned in the original issue description, but if I had to guess it's with the actual response from getObject. In v3 it returns a stream, and in NodeJS if you don't consume a stream the underlying connection might stay open.

This is covered here:

Because keepAlive is defaulted to true, if you acquire a streaming response, such as S3::getObject's Body field. You must read the stream to completion in order for the socket to close naturally.

Thanks, Ran~

alesk20 commented 4 months ago

Hi @RanVaknin, yes I publish and retrieve different objects to/from S3. When I use getObject operation I always consume the body like this:

const s3ObjectBody = await s3Object.Body.transformToByteArray();

Am I missing something?

Thanks.

EDIT: There was actually a point in the code where I was not consuming the Body stream. I fixed that, I'll let you know if the problem remains, but from my tests it seems to fix the issue, also removing the "requestTimeout" I put as a workaround.

Thank you again.

Kind regards.

github-actions[bot] commented 4 months ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.