StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.9k stars 1.51k forks source link

After enable the Authentication and SSL, it has a high chance to get "MOVED" errors #1266

Closed shimingwu closed 4 years ago

shimingwu commented 4 years ago

Hi. We are using StackExchange.Redis(v2.0.519) with AWS ElastiCache Redis for our product. Our Redis cluster has 3 shards, each shard has 3 nodes (1 master + 2 slaves). Our GET is set with CommandFlags.PreferSlave and we did not set any other command flags for other operations.

We have been used this version for a very long time, and only till recently we started to turn on the Authentication with Encryption as below (All other settings are not changed):

options.Password = password;
options.Ssl = true;
options.SslProtocols = SslProtocols.Tls12;

Additionally, we replace our clusters by using the Authentication and Encryption provided by AWS:

However, after we start to consume the functionality, every few days we will receive the errors https://github.com/StackExchange/StackExchange.Redis/blob/8672ca23b453921cb26c821375bef38d87e97b63/src/StackExchange.Redis/ResultProcessor.cs#L215

Based on my understanding, when the Redis client received the "MOVED" message like:

MOVED 9189 127.0.0.1:30004

It will try to redirect the request to the target node (127.0.0.1:30004) with CommandFlags.NoRedirect set. And if it receives the "MOVED" message twice, this error message will be shown.

I am very confused about:

  1. Why this only happens when the Authentication + Encryption turned on? Before we turn on this, the same flow works properly.
  2. Under what condition, we should expect the "MOVED" message twice?
  3. How can we avoid this error? Is there any other settings we should apply to our Redis cluster?

Additionally, from our logs, it seems like the data replication might be related to this. We observed that when the data replication happens, the number of MOVED errors are increasing. Do you have any ideas about this?

shimingwu commented 4 years ago

Also, we opened a case and seek the support from AWS side. One of the unknown thing is that:

I tried to check the online documentation for the client for the support of SSL. However, the support for SSL exists but I did not come across any documentation that clearly refers that it supports SSL+Auth for Clustered Redis.

Can I ask that will this be a key issue as we are using the AUTH+SSL for the Clustered Redis?

shimingwu commented 4 years ago

One more question is that

  1. Is the endpoint randomly chosen? Should we expect that the node will return us a master node endpoint with MOVED command? Yes. During a moved error it always returns the master endpoint of the shard to which the hashslot/keyslot belong to.

Can I ask that if the "MOVED" is pointing to a master node, and we are using the CommandFlags.PreferSlave for the GET operation, will the client send the request to the slave node and result in the second redirect?

From the code below, can I understand that it will redirect to slave? https://github.com/StackExchange/StackExchange.Redis/blob/efb98b8c54c7b9b33d43414cf314b33ef1203513/src/StackExchange.Redis/ServerSelectionStrategy.cs#L150

Is this the case where it will face many "MOVED" errors? But if this is the reason, it is hard to explain why before the AUTH+SSL enable, we do not see this type of errors.

mgravell commented 4 years ago

From the description, it sounds like there may be some kind of timing thing happening during the initial connect and node-list exchange. TLS and AUTH should work fine, but: I'll have to see whether I can repro a problem (and it sounds like there might indeed be one)

shimingwu commented 4 years ago

From the description, it sounds like there may be some kind of timing thing happening during the initial connect and node-list exchange. TLS and AUTH should work fine, but: I'll have to see whether I can repro a problem (and it sounds like there might indeed be one)

Hi @mgravell Thank you so much! Please let me know if you need more information! We are still trying to understand the investigate more about the root cause.

Additionally, personal I have one theory is that:

The master node endpoint is returned for the MOVED message. If it is confirmed that the client will always try to read from the slave node endpoint, and during the data replication (asynchronous replication does not guarantee we can read the data from replicas), the slave node might still not have the data and then return the MOVED again.

I am not sure whether this theory will work or not but I try to bring it up to AWS support team to ask for what is the probability for this to happen if we turn on the AUTH + SSL for the clustered Redis. I try to figure out why this will happen more during the data replication (the diagram can be seen from the AWS console).

mgravell commented 4 years ago

the library should know whether a node is a primary or secondary, and it should include the right modifiers to query a secondary; however - when initially performing the handshake, it might not know either the shard-map or the server config yet; it sounds like there is some race scenario, but I will have to investigate to see exactly what.

Does this happen only initially? or does it remain glitchy afterwards?

shimingwu commented 4 years ago

the library should know whether a node is a primary or secondary, and it should include the right modifiers to query a secondary; however - when initially performing the handshake, it might not know either the shard-map or the server config yet; it sounds like there is some race scenario, but I will have to investigate to see exactly what.

Does this happen only initially? or does it remain glitchy afterwards?

It is kind of happening often but not always. (But apparently not only initially) And for certain period the frequency will be high and last for few days. And next few days the number of errors will be lesser if I don’t remember wrongly. I will provide the distribution of the errors later.

shimingwu commented 4 years ago

Hi @mgravell , I am sorry for late reply. Here is the data:

It is observed that for certain period, we will see many GET/SET errors, and most of them (80%) are MOVED errors. The other 20% are timeout.

image

And here is our AWS console diagram, it shows the ReplicationBytes: image

For the same time period, here is the number of MOVED errors (interval is 30m) image

Hope this will help the investigation.

shimingwu commented 4 years ago

Hi @mgravell . Good morning! May I ask that is there any findings regarding this? Can i ask for the reason why we assume the MOVED will only happen once and set the NoRedirect for the resend request?

shimingwu commented 4 years ago

Besides the MOVED error, there is one more issue for the Ping with Auth+SSL.

There is once that one of our slave node is not functional, and then there are many ping errors happen after the recover of the slave node. It seems like all our instances start before the failure will face the "NOAUTH Authentication required" failure message but all instances start after the failure will not have this issue.

Can i ask that is it because the server snapshot of Redis shard is not updated and therefore the connection is failed due to the re-use of old connection as for the same connection, we just need to authenticate once?

I want to know that will re-init the multiplexer helps to handle this NOAUTH ping issue after failure as a work-around? (it is also strange to see that only ping failed but write/read is working...)

mgravell commented 4 years ago

"Can i ask for the reason why we assume the MOVED will only happen once"

The only time we should see MOVED is when the client doesn't yet know something (perhaps something new, perhaps initially) about the shard topology. When a server replies with MOVED, it tells us where to find it, so ... it should be there. If the server is wrong about the shard topology, something is very very wrong with the cluster.

Now; I could understand maybe making this "allow 2 redirects, not just 1", but we certainly shouldn't allow unbounded.

Make sense?

shimingwu commented 4 years ago

Hi @mgravell , thank you very much. I think it makes sense. However, I am still unsure about the root cause. Can I ask that is it confirmed by theory that 2-redirects will be enough?

If yes, allow 2 redirects is definitely a good idea.

If no, I would like to suggest that are we able to open a parameter like "maxMovedTime" and the default value is 1. Why I feel exposing it as a parameter will be helpful is that, as a user, we can configure this based on our demand. Additionally, if the root cause is really about the data replication delay, the 2 redirects might not be enough, then we may still face the same issue we have now. If we assume the hash slot is correct and the redirect happens within a shard, then will the max number of redirects be # of nodes inside the shard - 1?

Overall, 2 redirects will help to mitigate the problem and it will help us to get more time to find the root cause.

mgravell commented 4 years ago

No, nothing is confirmed yet. And this isn't a data replication latency issue. This scenario would only happenly unexpectedly during a shard change, which doesn't happen automatically or often, and when it does happen: it still shouldn't necessitate more than one redirect.

On Sun, 17 Nov 2019, 21:14 shiming, notifications@github.com wrote:

Hi @mgravell https://github.com/mgravell , thank you very much. I think it makes sense. However, I am still unsure about the root cause. Can I ask that is it confirmed by theory that 2-redirects will be enough?

If yes, allow 2 redirects is definitely a good idea.

If no, I would like to suggest that are we able to open a parameter like "maxMovedTime" and the default value is 1. Why I feel exposing it as a parameter will be helpful is that, as a user, we can configure this based on our demand. Additionally, if the root cause is really about the data replication delay, the 2 redirects might not be enough, then we may still face the same issue we have now.

Overall, 2 redirects will help to mitigate the problem and it will help us to get more time to find the root cause.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/StackExchange/StackExchange.Redis/issues/1266?email_source=notifications&email_token=AAAEHMES7M4V2ALGQWFBR63QUICA3A5CNFSM4JIPL26KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEJBIGA#issuecomment-554832920, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMCA3Y2P2J4DGMNJCFLQUICA3ANCNFSM4JIPL26A .

shimingwu commented 4 years ago

Hi @mgravell , thank you for the patient explanations. I think I have a wrong understanding before as I never realize that redirect twice should only happen for re-shard or re-configuration. As what I get from our DevOps, there is no re-shard happens in our case. Additionally, as we can see from the logs, the error actually happens quite often and it lasts for quite long period.

Another thing I noticed that, during one of our experiment on our QA environment, which only has 1 shard with 3 nodes, the MOVED error also happens for around 1 hour after I deleted a node then add it back. I cannot explain this properly. I read the Redis Cluster specification and find one thing which is that:

When the connection is in readonly mode, the cluster will send a redirection to the client only if the operation involves keys not served by the slave's master node.

Can I confirm with you that do we send the READONLY command to the slave node?

If we do not send, is it possible to have twice redirect as flows:

  1. We send to a wrong shard, received one MOVED ERROR
  2. We send to a correct shard based on MOVED message, however, due to the PreferSlave, we send to the slave node, received MOVED ERROR which redirect to master

All these assumptions based on we will send to slave instead of master node when we received the MOVED ERROR. Can I confirm with you that do we really redirect the request to slave node for MOVED + PreferSlave?

If we already set the READONLY and send to master node when MOVED is received, I cannot think of any other case where there are so many MOVED errors happen. I may need to double confirm with AWS support team again...

NickCraver commented 4 years ago

Good news: this was found an fixed in #1367. Look out for the 2.1.x release soon including this fix!

shimingwu commented 4 years ago

@NickCraver I am glad to hear this! Thanks!