cloudant / nodejs-cloudant

Cloudant Node.js client library
Apache License 2.0
255 stars 90 forks source link

Can not get response from @cloudant/cloudant #448

Closed bjzhangl closed 3 years ago

bjzhangl commented 3 years ago

We use @cloudant/cloudant 4.2.2. Recently on our production env, the nodejs server can not get response from @cloudant/cloudant which led the server hangs there. We checked the cloudant server log and confirmed that the cloudant server has returned response to @cloudant/cloudant. When we restart the nodejs server, this problem get away. Since we does not open debug for @cloudant/cloudant, there is no log for debugging this issue. Do you have any guess and advice to provide to help address this issue? Thanks a lot.

ricellis commented 3 years ago

Please try to answer the questions from the issue template:

We checked the cloudant server log and confirmed that the cloudant server has returned response to @cloudant/cloudant.

Have Cloudant support investigated the logs or are you talking about e.g. an AT log event?

there is no log for debugging this issue

It would be a lot easier if you could get that.

Do you have any guess and advice to provide to help address this issue?

I might be able to help more if you are able to answer the questions. In any case I'd recommend:

bjzhangl commented 3 years ago

@ricellis These are the answers to your questions. If any other information needed, please ask me.

  1. nodejs version is 12.21.0.
  2. The function we calling is db.get of @cloudant/cloudant. We can not reproduce this problem. We run nodejs server about several months and this problem happen. After we restart our nodejs server, it disappeared.
  3. Cloudant team help provide the following logs. Can the log show that cloudant server finished the request and give a response to @cloudant/cloudant?
    Apr 24 13:48:18 lb2.bm-cc-us-south-20 haproxy[26873]: 149.81.104.8:35862 [24/Apr/2021:13:48:18.299] http_proxy~ bm-cc-us-south-20/db2 126/0/0/0/2/6/0/126/134/8 200 854 - - ---- 216/216/28/2/0 0/0 {nodejs-cloudant/4.2.2 (Node.js v12.21.0)|10da0449-318b-493d-9289-405e44eca5c4-bluemix||TLSv1.3||TLS_AES_256_GCM_SHA384|lookup} {a3fb2c0198|||lookup|||} "GET /dbaas_config/global HTTP/1.1" acct:"10da0449-318b-493d-9289-405e44eca5c4-bluemix"
    Apr 24 13:48:18 lb2.bm-cc-us-south-20 haproxy[26872]: 161.156.109.8:47880 [24/Apr/2021:13:48:18.251] http_proxy~ bm-cc-us-south-20/db5 127/0/0/0/2/3/0/127/132/5 401 441 - - ---- 185/185/27/0/0 0/0 {nodejs-cloudant/4.2.2 (Node.js v12.21.0)|10da0449-318b-493d-9289-405e44eca5c4-bluemix||TLSv1.3||TLS_AES_256_GCM_SHA384|lookup} {2ecf362e7a|||lookup|||} "GET /dbaas_config/global HTTP/1.1" acct:"10da0449-318b-493d-9289-405e44eca5c4-bluemix"
  4. Unluckily, we did not open debug for @cloudant/cloudant. So there is no any log generated by @cloudant/cloudant.
ricellis commented 3 years ago

Cloudant team help provide the following logs. Can the log show that cloudant server finished the request and give a response to @cloudant/cloudant?

Yes, it does. My read of those is that you have at least 2 client instances from 2 different IPs. The second one has received a 401 - the client was unauthorized. This will have called back with an error, maybe this impacted your server process?

WRT the cause of the 401 - it is likely that the application failed to successfully exchange its IAM API key for a token with the identity service. This could happen for a number of reasons:

  1. the IAM service was down (unlikely in this case; I don't see any outage reported at that time)
  2. the IAM service rejected the request for some other reason (e.g. exceeding rate limit)
  3. the API key was no longer valid

IAM tokens are typically valid for 1 hour and nodejs-cloudant tries to renew them at 50% of the lifetime. When renewal fails nodejs-cloudant will repeatedly keep trying to renew the token, but if it is unable to successfully do so before the token expires the server will start rejecting requests with 401 responses.

bjzhangl commented 3 years ago

@ricellis Thanks!

  1. The response with 401 is the first response from server and 200 is the next response from server. We only call db.get once at that time. My thought was that cloudant client found 401 returned, then it would refresh iam token and retry the request again. From your reply, seems not?
  2. I checked the nodejs-cloudant code and find the renew interval for iam token is configured in cfg, or it will be 1 hour. Should I configure that when I new a cloudant object?
  3. I notice that there is another cloudant client "cloudant-node-sdk". I found "cloudant-node-sdk" is still a pre-release version. Which one is recommended to use? Thanks!
ricellis commented 3 years ago

The response with 401 is the first response from server and 200 is the next response from server. We only call db.get once at that time. My thought was that cloudant client found 401 returned, then it would refresh iam token and retry the request again. From your reply, seems not?

No, it should already have been trying to renew the token for 30 min and was unable to in that time so it has given up and failed. I can't see the IAM side of things so I don't know why it failed - the error/reason on the 401 (or log messages) in nodejs-cloudant would have given more detail. AFAICT the second request (that succeeded) is on separate renewal cycle (i.e. it is a second different instance of nodejs-cloudant). I can't tell you how your node server works, e.g. whether you have any logic balancing or retrying requests between those 2 instances or whether the IAM configuration was different between them, but it seems to me that 1 instance had good creds and 1 had creds that expired.

I checked the nodejs-cloudant code and find the renew interval for iam token is configured in cfg, or it will be 1 hour. Should I configure that when I new a cloudant object?

No, you should not need to change this configuration, it is 50% of the token lifetime by default, so it attempts to renew starting every 30 minutes and if it fails it continues to retry at regular intervals (I think every minute).

Which one is recommended to use?

cloudant-node-sdk is the recommended - it is where our future development effort will be focused. This library, nodejs-cloudant, is maintenance only.

bjzhangl commented 3 years ago

@ricellis thanks for your reply! My concern is that cloudant-node-sdk is still pre-release. And there is no a normal release version. Could I use this pre-release version in our production env? Why it is still pre-release? Does it mean the function is not completed or it may have some major change later?

ricellis commented 3 years ago

I've seen no evidence to suggest that the behaviour reported in this issue would have been any different with cloudant-node-sdk; AFAICT the failure that you saw is not a nodejs-cloudant issue. I just want to be clear that moving from nodejs-cloudant to cloudant-node-sdk is a separate topic from the problem reported in this issue.

Could I use this pre-release version in our production env?

The library is supported. Whether it is suitable for your production environment can only be determined based on the needs of your application and testing. The migration guide outlines differences that might help you identify if the features you need are available.

Why it is still pre-release? Does it mean the function is not completed or it may have some major change later?

The core API functions are complete. The functional core of the library is https://github.com/IBM/node-sdk-core/ which is a stable 2.10+ release. It is a pre-release because we have not complete our testing program and there may be some further API changes. I don't expect a significant amount of API churn - the majority of breaking API changes we've made so far during 0.x have been either renames or removing options that either didn't function or were misplaced, so relatively low impact things to resolve for consumers of the library. Obviously there are measures you can take, like pinning to a specific pre-release version, to prevent those changes having an impact on your application until you are able to make adjustments to accommodate them.

bjzhangl commented 3 years ago

@ricellis Got it! My new finding for this issue is that after cloudant server returned 401, the nodejs server hangs at accessing cloudant. And we can not find any request and response log from cloudant server. But we can get response from 'https://iam.cloud.ibm.com/oidc/keys'. We are now trying to get some logs from iam. Do you have any idea on this? I am sure that our api key is valid.

ricellis commented 3 years ago

The API keys used by nodejs-cloudant are exchanged for tokens at iam.cloud.ibm.com/identity/token by default, I do not know about oidc/keys.

Do you have any idea on this? I am sure that our api key is valid.

Sorry no, as I said before it looks to me like there are two instances of nodejs-cloudant, one does indeed appear to have valid creds (as there are successful requests after the token timeout) and one does not (stops after the 401).

bjzhangl commented 3 years ago

@ricellis we found the same issue in another cloudant instance which is affected by dos attack on May 24. At first we can receive response from cloudant after about more than 30min or 1 hour when our requests were sent out. And then we can not receive anything. And the api hangs at getting data from cloudant. This is our setting for cloudantdb. We have only one cloudant instance on each server.

ev.db = new Cloudant({
        url: ev.dbUrl,
        maxAttempt: 5,
        plugins: [
            {
                iamauth: {
                    iamTokenUrl: iamTokenUrl,
                    iamApiKey: apikey
                }
            },
            {
                'retry': {
                    retryErrors: true,
                    retryStatusCodes: [429, 500, 501, 502, 503, 504],
                    retryInitialDelayMsecs: 1000,
                }
            },
        ]
    });

This is how we get data from cloudant.

db.get(id, function(err, result) {
        if(err) {
            callback(cloudantError(err));
            return;
        }
        callback(null, result);
    });

My question is that does nodejs-cloudant return an error when a request is sent out and does not receive a response exceed a default time setting? Or we need to set this when new a cloudant instance? This is one cloudant log which shows haproxy accept request at 13:17:36.700 but receive all request message at about 13:58:06.

May 25 13:58:06 lb1.bm-cc-us-east-02 lb1.bm-cc-us-east-02 haproxy[29101]: 52.116.29.19:36306 [25/May/2021:13:17:36.700] http_proxy~ bm-cc-us-east-02/<NOSRV> 122176/20664/2429898/-1/-1/-1/-1/2572738/2572738/2429898 -1 0 - - cR-- 8772/8772/1733/0/4 0/0 {nodejs-cloudant/4.2.2 (Node.js v12.21.0)|3cfbe857-533e-4160-9cd4-f9908c38441d-bluemix||TLSv1.3||TLS_AES_256_GCM_SHA384|lookup} {||||||} "GET /dbaas_cluster/29c11df7-d68a-4f06-9777-27785e55a25d HTTP/1.1" acct:"3cfbe857-533e-4160-9cd4-f9908c38441d-bluemix"
ricellis commented 3 years ago

My question is that does nodejs-cloudant return an error when a request is sent out and does not receive a response exceed a default time setting? Or we need to set this when new a cloudant instance?

I thought there was a default timeout configured somewhere, but I can't see it - so I guess there is no read timeout applied by default. I believe an underlying OS connect timeout (usually 20s) does apply and will return an error if nodejs-cloudant could not connect to the server. You can configure a timeout property in the requestDefaults.

Note that the timeout errors will be suppressed by your retry configuration until after all the attempts. So if you do configure a timeout of say 30s and , in your case it will take nearly 3 minutes to actually return a timeout error because of the 5 attempts, plus the doubling back-off.

This is one cloudant log which shows haproxy accept request at 13:17:36.700 but receive all request message at about 13:58:06.

The cR termination state from that log entry indicates the server was still attempting to read the request from the client when it terminated. As such I'm unsure whether a read timeout would have actually helped in this case since the client was still sending. The server disconnect will cause an error to be returned in nodejs-cloudant, but if all 5 attempts of your request took around 40 min to get disconnected then you wouldn't see it for ~200 min. Normally the server would disconnect if the request was not fully received from the client within 2.5 minutes, but obviously in this case the server was not behaving typically.