cloudant / nodejs-cloudant

Cloudant Node.js client library
Apache License 2.0
255 stars 90 forks source link

nodejs-cloudant hangs when response from cloudant is slow #456

Closed bjzhangl closed 3 years ago

bjzhangl commented 3 years ago

We use nodejs-cloudant 4.2.2. And this can be reproduced on 4.4.0. Cloudant met DoS attack some days ago, and we found nodejs-cloudant hangs during this time in our production env. We also reproduced this by using following https server instead of cloudant server.

var app = https.createServer(options, function(req, res){
  console.log("=====================Helllo world.======================");
  let body = 'hello world'
  res.write(body);
//  res.end();
}).listen(443, "0.0.0.0");

In reproduce situation, our timeout is 30000ms. In production env, we do not set timeout. I debug the nodejs-cloudant in reproduce case with timeout. And found in https://github.com/cloudant/nodejs-cloudant/blob/263dddaa07f1c50d8cc1fe341646cfbea574d87c/lib/client.js#L230:

request.response
      .on('response', function(response) {
        request.response.pause();
        utils.runHooks('onResponse', request, response, function() {
          utils.processState(request, done); // process response hook results
        });
      });

The parameter of done callback in utils.processState is "new Error('No retry requested')" which causes async.forever exits. When timeout abort emit, nodejs-cloudant run the code section from the line "https://github.com/cloudant/nodejs-cloudant/blob/263dddaa07f1c50d8cc1fe341646cfbea574d87c/lib/clientutils.js#L158" to try to retry the request. Since async.forever has exit, nodejs-cloudant hangs there.

It seems "https://github.com/cloudant/nodejs-cloudant/blob/263dddaa07f1c50d8cc1fe341646cfbea574d87c/lib/clientutils.js#L167" and "https://github.com/cloudant/nodejs-cloudant/blob/263dddaa07f1c50d8cc1fe341646cfbea574d87c/lib/clientutils.js#L163" should be the best place to exit async.forever. Exit aync.forever in reponse hooks may cause problem when response is emit by request package but network problem happend.

Because my reproduce is not based on the real cloudant send response scenario, and i am not very familiar with request package, my analysis may be not right. But please check if there is some problem in nodejs-cloudant even if we add timeout in our code explicitly. We had met such hang issues in our production env 3 times in last and this month. And this also cause our service can not restore even if cloudant server restores. We must restart our service manually to make it work.

ricellis commented 3 years ago

Thanks for the detailed report, we'll look into it.

bjzhangl commented 3 years ago

@ricellis Thanks! I also made a simple fix by changing the following code in client.js

request.response
      .on('response', function(response) {
        request.response.pause();
        utils.runHooks('onResponse', request, response, function() {
          utils.processState(request, done); // process response hook results
        });
      });

to be

request.response
      .on('response', function(response) {
        request.response.pause();
        utils.runHooks('onResponse', request, response, function() {
          //utils.processState(request, done); // process response hook results
          utils.processState(request, function(stop) {
            return;
          });
        });
      });

and add the following line after https://github.com/cloudant/nodejs-cloudant/blob/263dddaa07f1c50d8cc1fe341646cfbea574d87c/lib/clientutils.js#L167

done(new Error('No retry requested'));

I am not sure if my fix is enough and reasonable enough. This issue is urgent for us. It's very appreciate if you can give a conclusion and provide a fix ASAP.

ricellis commented 3 years ago

Initially I wasn't able to reproduce this locally, but then I remembered from the previous issue that you use the retry plugin. I was able to reproduce it locally when using the retry plugin, but the timeout also must occur during the data phase of the response. If the timeout occurs before that (e.g. during the connect/headers) then the problem is not exposed and the retries execute as expected. I think this is part of the reason we weren't able to detect it before - standard use of the nock mock server used in the tests does not appear to be able to reproduce the issue. It has three options for mocking a timeout error: replyWithError and delayConnection both occur before the response event so don't expose this issue and while delayBody does seemingly happen in the correct phase, when using that option the nock library doesn't appear to surface timeouts from the overridden request framework.

bjzhangl commented 3 years ago

@ricellis yes. This is due to retry plugin. Please give a fix time if you have. Thanks. Does the new cloudant client can fix this issue?

ricellis commented 3 years ago

This is due to retry plugin.

Right. It only happens when all 3 of these conditions are satisfied:

  1. using a callback
  2. the retry plugin is enabled
  3. an error occurs after the response has started

It is possible to avoid the problem by using the "listener" style or removing the retry plugin.

Please give a fix time if you have

We're working on it, it's complicated by the mix of callback and listener styles. Your proposal works ok for stopping the hang you observe, but it appears to introduce a different one for some other flows.

Does the new cloudant client can fix this issue?

Certainly this specific problem won't exist in the new client because the retry plugin doesn't exist there. When retry functionality is added there (via the node-sdk-core) it will be a different implementation.

bjzhangl commented 3 years ago

@ricellis Thanks for your quick fix. When will you have a new release to include this fix? We use npm to install the package.

ricellis commented 3 years ago

It will be in the next release, we hope to do that soon, we're just working on one other change for inclusion in the next release.

ricellis commented 3 years ago

This is relased in https://github.com/cloudant/nodejs-cloudant/releases/tag/4.4.1