bossadvisors / memcached-session-manager

Automatically exported from code.google.com/p/memcached-session-manager
0 stars 0 forks source link

node failure handling fails due to spymemcached 2.7.3 to 2.8.12 update #167

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. start tomcat
2. shutdown couchbase cluster
3. enter tomcat page
4. start couchbase cluster
5. enter tomcat page

Version: 1.6.5

With the change of spymemcached 2.7.3 to 2.8.12 the split of 
couchbase/memcached library was made. With this change an undetected change was 
made to the connection factory that is responsible for redistributing requests 
that are addressed to a dead node (before BinaryConnectionFactory, now 
CouchbaseConnectionFactoryBuilder). the attribute that is used to determine the 
failure handling is CouchbaseConnectionFactory.DEFAULT_FAILURE_MODE = 
FailureMode.Retry (before: DefaultConnectionFactory.DEFAULT_FAILURE_MODE = 
FailureMode.Redistribute).

this results in continous requests to a dead node that are already canceled and 
should be switched back

Redistribute: In this failure mode, the failure of a node will cause its 
current queue and future requests to move to the next logical node in the 
cluster for a given key.
Retry: This failure mode is appropriate when you have a rare short downtime of 
a memcached node that will be back quickly, and your app is written to not wait 
very long for async command completion.

Reconnecting {QA sa=localhost/127.0.0.1:11210, #Rops=0, #Wops=6, #iq=0, 
topRop=null, topWop=Cmd: 0 Op
Connection state changed for sun.nio.ch.SelectionKeyImpl@16ab5303
Reconnecting due to exception on {QA sa=localhost/127.0.0.1:11210, #Rops=7, 
#Wops=0, #iq=0, topRop=Cm
java.io.IOException: Disconnected unexpected, will reconnect.
        at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:526)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:430)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:247)
        at com.couchbase.client.CouchbaseConnection.run(CouchbaseConnection.java:265)
Closing, and reopening {QA sa=localhost/127.0.0.1:11210, #Rops=7, #Wops=0, 
#iq=0, topRop=Cmd: 0 Opaqu
Discarding partially completed op: Cmd: 0 Opaque: 238 Key: 
1B3A927693D62D89DFD1595A5
Discarding partially completed op: Cmd: 0 Opaque: 239 Key: 
1B3A927693D62D89DFD1595A5
Could not load session with id 1B3A927693D62D89DFD1595A521E6F61 from memcached.
java.util.concurrent.CancellationException: Cancelled
        at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:170)
        at net.spy.memcached.internal.GetFuture.get(GetFuture.java:62)
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1026)
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1051)
        at de.javakaffee.web.msm.MemcachedSessionService.loadFromMemcached(MemcachedSessionService.java:1112)

pull request: https://github.com/magro/memcached-session-manager/pull/27

Original issue reported on code.google.com by hajo.kli...@gmail.com on 1 Jul 2013 at 3:17

GoogleCodeExporter commented 9 years ago
Ouch, good catch!

For couchbase I agree that FailureMode.Redistribute is the correct choice.
For memcached though I'd say FailureMode.Cancel should be used, so that msm can 
kick in with its redistribution (I wonder why this was not covered before). 
What do you think?

Original comment by martin.grotzke on 1 Jul 2013 at 3:40

GoogleCodeExporter commented 9 years ago
i'm not using memcached but i think it is not a good idea to cancel the whole 
operation. if you got multiple memcached instances, on each instance the 
session is stored. if one instance goes down, i think it is a better idea to 
redistribute the operation to the running one

Original comment by hajo.kli...@gmail.com on 2 Jul 2013 at 3:16

GoogleCodeExporter commented 9 years ago
Fixed, pull request is merged.

I also checked the failure mode for memcached connections and it's in fact set 
to Cancel already (in msm specific connection factories).

Original comment by martin.grotzke on 24 Aug 2013 at 10:22

GoogleCodeExporter commented 9 years ago

Original comment by martin.grotzke on 20 Dec 2013 at 10:20