Alachisoft / NCache

NCache: Highly Scalable Distributed Cache for .NET
http://www.alachisoft.com
Apache License 2.0
647 stars 123 forks source link

NCache Client "No server is available to process the request" after connectivity outage #31

Closed juanmanuelrios closed 5 years ago

juanmanuelrios commented 5 years ago

Hi

During some connectivity tests the cache client becomes unusable, throwing exceptions on every intent to make request on the client instance. We tried to find out a way to overcome this problem, but without success: once a client is disconnected from network for a while, it never come back to work fine.

Using one of the samples provided (BasicOperation, with minor changes), next picture denotes where exceptions arise at some points in code.

ncachenetworkerror

Our aim is to use Pub/Sub mechanism, but testts were tailored down to basic samples in order to narrow complexity and speed-up connectivity tests. Anyway, Pub/Sub sample dosn't worked neither.

Please someone can give us a hint on how to use overcome to termporal connectiviy outages using cache client? Do we need to dispose current client instance when connection exceptions are throwed? In the context of Pub/Sub, how can we do the same with subscriptor, regarding this object doesn't provied a way to know when connectivity errors, or other exceptions, arise,

Many thanks.

Kal-Alachisoft commented 5 years ago

Hey Juan, thanks for the detailed question. We had our team try to reproduce your connectivity breakage scenario with NCache OpenSource 4.9 SP1 edition but we were unable to do so. Therefore, please share some more details regarding your test so I can have my team replicate your exact scenario at our end and see if we can reproduce the same result here.

  1. Please share the exact version and edition of NCache that you are using.
  2. Please also confirm if you are using .NET core client or if you are working with .NET framework client. Please share framework version details for this too.
  3. Please specify how exactly you are disconnecting the network for your test. We used firewalls to restrict network access on our end, please let us know if we are missing something here.

Moreover, based on your findings regarding the RetryConnectionDelay values being delayed within the NCache code, our QA engineers have suggested a workaround that might help you with the reconnection issue. Our QA has suggested that you explicitly set the "RetryConnectionDelay" value as 1 (minimum value) using CacheInitParams and that is expected to fix this reconnection issue. Please try this workaround in your environment and let me know if your application is able to connect back to the cache now.

Once you will share above information with us, we will work on this to reproduce this issue. A fix will be released in upcoming release of NCache if it is a bug.

Please let me know if you have any questions.

juanmanuelrios commented 5 years ago

Hi Kal, thanks for your response and my apologies for not providing information about our testing environment.

We are using NCache Client 4.9 SP1 and Pub/Sub sample included in source code as well. .Net Framework remain unchanged from sample, so it is targeted as v4.5.2.

A little bit of info about the scenario on we are facing problems. We have a remote NCache Server on a LAN network (with no firewall), connected via TCP at default port.

A side note about Pub/Sub sample: in order to be "fault-tolerant" and keep trying to send messages, we made a two changes on publisher, first protecting publish method using a try-catch block, and second, using a for-ever while(true) cycle instead of the original for. As you can see as follow: publisher_program

Once changes were made, we setted RetryConnectionDelay=5 on NCacche Client config file and started the Pub / Sub sample on a client computer, different from server. After check tha it was working fine, we unplugged LAN wire until we see "No server is available to process the request" error. Then we plugged it again, but publisher never came back to work.

Regarding mentioned workaround, repeating the same test scenario we are getting same result, as you can see in screenshots as follows:

This is publisher startup configuration, where we setted RetryConnectionDelay=1 publisher config

We debugged NCache Client to point out where we point out problem is arising, as you can see in next picture. CacheInitParams has a value of 1000 (1 second configured at Publisher StartUp) this is ok, but then it is cloned and the value becomes much higher. (1000000). Both of them are shown on picture as watched values during debug session. clientconfiguration

Final note, if we use RetryConnectionDelay=0, cache client attempts to reconnect immediately, as we expected, but as soon we start to use a value greater than zero the reconnection delay time is not what we expected.

I hope this could help you to reproduce it. Anyway, if it won't, please let me know again.

Many thanks in advance.

Kal-Alachisoft commented 5 years ago

Update: Juan, We have successfully reproduced the issue in our environment and your findings regarding the issue were on point. The engineering team has identified and confirmed this issue as a bug. We are currently working on a fix for this bug which will be included in our next major NCache release.

For the time being, the workaround for this is to set the "RetryConnectionDelay" value as "0" as suggested by you and the client will be able to successfully re-connect to the server. I made a typo in my previous comment, the value that you need to set is 0 and not 1.

Thank you for reporting the bug to us. We highly appreciate your valuable feedback.

juanmanuelrios commented 5 years ago

Great!

We expect the fix introduced on next major release.

Thanks for your support.

waleed-anjum commented 5 years ago

Bug fixed with the commit id https://github.com/Alachisoft/NCache/commit/5e4ec047ead0554415e67a4ed8eeb0129741c7d1 please verify and if there is any confusion please ask. You identified the bug correctly .The Clone method was multiplying the RetryConnectionDelay value with thousand that was causing this issue , Clone method is corrected accordingly. Thank you for reporting the bug to us. We highly appreciate your valuable feedback.