IBMStreams / streamsx.dps

The IBMStreams/streamsx.dps GitHub repository is home to the Streams toolkit named DPS (Distributed Process Store)
http://ibmstreams.github.io/streamsx.dps
Other
4 stars 9 forks source link

Redis-Cluster problem #64

Closed a-haroun closed 7 years ago

a-haroun commented 7 years ago

Hello,

We have an issue when using the redis-cluster mode, when the node on which we are connected disconnects, the PE goes unhealthy! it does not try to connect to other nodes of the cluster. I am wondering what is the advantage of using a cluster in this case?

Thanks in advance, Amir.

nysenthil commented 7 years ago

Hi Amir, There was a C++ pointer bug in the Redis-Cluster wrapper client library that caused the PE to crash on a cluster master node outage. I fixed it in the Redis-Cluster wrapper code that we ship in the DPS toolkit. I also made changes in the DPS toolkit to gracefully handle a master node failure in a Redis cluster. Users can do the following in their Streams application code for continued operation when a Redis cluster auto-reconfigures itself during cluster master node outages.

1) In the no-sql-kv-store-servers.cfg file, users can now provide more than one Redis cluster master node information. (It is simply a good idea to specify all the master nodes and then all the slave nodes so that there are more possibilities for the DPS toolkit code to reconnect when a Redis cluster undergoes auto reconfiguration during a master node outage.)

Machine1:7001 Machine2:7002 Machine3:7003 Machine4:7004

2) In the Streams application code, whenever a DPS API is called, its return error code can be checked to see if there is a sudden connection failure in the Redis cluster. If a connection failure is detected, application code should call a local user written function to reconnect to the Redis cluster as shown below.

mutable uint64 err = 0ul; rstring myKey = “TestKey”; rstring myValue = “TestValue”; dpsPut(storeHandle, myKey, myValue, err);

// DPS_CONNECTION_ERROR = 102, DL_CONNECTION_ERROR = 501 if(err == 102ul || err == 501ul) { // When this function returns true, you are reconnected to the // Redis cluster and you can retry your failed API. // When it returns false, it means you are not connected to the Redis cluster. reconnectWithRedisCluster(100); }

public boolean reconnectWithRedisCluster(int32 maxConnectionAttemptCnt) { mutable int32 attemptCnt = 0;

while(dpsReconnect() == false) { if (++attemptCnt >= maxConnectionAttemptCnt) { // Giving up after the specified number of reconnection attempts. return(false); }

  // Wait before trying to connect again.
  block(2.0);

}

// Successfully reconnected to the Redis cluster. return(true); }

3) We will do additional testing in our lab and make a new release in the master branch in another 3 weeks. In the meantime, a development version of the DPS toolkit with this fix can be obtained from here for your local testing.

https://github.com/IBMStreams/streamsx.dps/tree/develop

4) That open source Redis cluster client C++ wrapper pointer bug will certainly haunt other Redis cluster users outside the context of Streams as well. So, I alerted the author of that open source code to apply that simple fix in his original code.

https://github.com/shinberg/cpp-hiredis-cluster/issues/25#issuecomment-326489160

a-haroun commented 7 years ago

Perfect, thank you very much @nysenthil, you have done a greate job! I will test within streams and i'll let you know

Regards, Amir

a-haroun commented 7 years ago

Update: @nysenthil, During the make, i saw that the toolkit uses an external directory (impl/ext/lib and include). However streams sab file does not take this directory into account so it does not copy it into the bundle. So during execution we have the error saying that the job does not find some libraries situated in the ext directory. Have you done some tests on streams, otherwise can we use just the standard libraries lib directory?

Thank you in advance, Amir.

nysenthil commented 7 years ago

Hi Amir, We committed that fix and made a new DPS toolkit release in the master branch as version 3.2.1 https://github.com/IBMStreams/streamsx.dps.

I just now downloaded that DPS toolkit version and built the toolkit first and then built the samples/advanced/004_XXXX project present inside the toolkit directory. I ran it against a 10 node Redis cluster and it worked fine. I used Streams 4.2.1 running on RHEL7. Could you please try with the DPS toolkit version 3.2.1?

brandtol commented 7 years ago

@chantalokh If the problem persists, can you send the output of "spl-app-info --files sabfilename" command for the sab in question.

a-haroun commented 7 years ago

Dear all, Thank you for your response, During compilation we have the following error (we had it since the last version but i forgot to tell you about it). By the way, we are using SLES11 os. In file included from ext/include/libmemcached-1.0/memcached.h:85, from ./include/MemcachedDBLayer.h:55, from src/MemcachedDBLayer.cpp:84: ext/include/libmemcached-1.0/struct/sasl.h:39:23: error: sasl/sasl.h: No such file or directory In file included from ext/include/libmemcached-1.0/memcached.h:85, from ./include/MemcachedDBLayer.h:55, from src/MemcachedDBLayer.cpp:84: ext/include/libmemcached-1.0/struct/sasl.h:47: error: expected ‘;’ before ‘’ token In file included from ext/include/libmemcached-1.0/memcached.h:120, from ./include/MemcachedDBLayer.h:55, from src/MemcachedDBLayer.cpp:84: ext/include/libmemcached-1.0/sasl.h:52: error: expected ‘,’ or ‘...’ before ‘’ token ext/include/libmemcached-1.0/sasl.h:64: error: expected constructor, destructor, or type conversion before ‘*’ token make[1]: [MemcachedDBLayer.o] Error 1 make[1]: Leaving directory `/users/bgd00/libext/toolkits/toto/streamsx.dps-master/com.ibm.streamsx.dps/impl' make: [com.ibm.streamsx.dps/impl.all] Error 2

But since we are not using memcached, i removed it from the makefile. We will test the new versions and we'll let you know.

Amir

brandtol commented 7 years ago

I checked the output of our build and did not see an error for SLES11. But it is likely that your environment differs from our build machines environment. Can you attach a log file for the failed build ? make clean all >/tmp/buildlog.txt 2>&1

a-haroun commented 7 years ago

Please find attached the log file buildlog.txt

Thank you

brandtol commented 7 years ago

Hi Amir, it seems there is a package missing on your system. Can you try to install "cyrus-sasl-devel" ? On our build system, this is the package that includes the missing header files.

I think we need to document this dependency in the wiki page, or precompile the memcached package without SASL.

a-haroun commented 7 years ago

Hi, Update: We just tested the new toolkit and it works perfectly, the dpsReconnect() now takes into account the whole list of servers of the cluster.

For the memcached problem, i think that if we add the missing library, it will work, but for now we are not using memcached so its ok.

Thank you all for your support, we can close the issue if everybody is OK. Amir.

nysenthil commented 7 years ago

Hi Amir, It is great to know that your redis-cluster HA problem with the DPS toolkit is now taken care of with our fix. Before we can close the issue, could you please tell us which Streams customer you are part of and in which country so that we can keep it in our internal records about which customer faced this problem, when and how the fix resolved the issue etc.

Thank you.

a-haroun commented 7 years ago

Hi, well i am using my personal github account, the customer is PSA Group from France.

Regards, Amir

nysenthil commented 7 years ago

Hi Amir, Thank you for leading us to debug this issue and fix it. This is a good one now for handling Redis cluster outages all at the level of DPS configuration and inside the user application code. I'm closing this issue now.