Spontaneous batches of TimeoutExceptions after few hours of running

GoogleCodeExporter commented 8 years ago

There is a platform which uses xmemcached under small load (about 1000-2000 
transactions per second). Everything works fine for some time, then in few 
hours there is a bunch (~100-200) of TimeoutException in the log:

2012-04-27 04:40:18,447 [ER-3] ERROR mem-api.cache      : 
XMemcachedClientConnector.get() key:DRT:pcef1.Gx;1317897826;24601921, 
getBeforeCount:49206377, getAfterCount:49206368, getDeltaCount:-9, 
getLatencyMax:126, getLatencyAvg:0.21611751552156827
2012-04-27 04:40:18,448 [ER-3] WARN  sm.generated.TestApplication : #### 
Unexpected exception caught: java.util.concurrent.TimeoutException: Timed 
out(1000) waiting for operation
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timed 
out(1000) waiting for operation
    at com.test.memcache.impl.XMemcachedClientConnector.get(XMemcachedClientConnector.java:91)
    at com.test.memcache.impl.MemCacheImpl.get(MemCacheImpl.java:223)
    at com.test.app.d2d.MemCacheManager.loadSessionState(MemCacheManager.java:55)

Then everything works fine for another few hours until next batch of 
TimeoutExceptions.
I took tcpdump during TimeoutExceptions and according to it there are no 
packets towards repcached corresponding to timedout "get" operations. It means 
that xmemcached retains application get commands and doesn't send them to 
repcached.

Also I took thread dump during TimeoutExceptions. All application threads 
(ER-[number]) are waiting on latchWait():

"ER-8" prio=10 tid=0x00002aaabd0f5000 nid=0x43cb waiting on condition 
[0x00000000446d6000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x000000077f570b98> (a java.util.concurrent.CountDownLatch$Sync)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1011)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1303)
    at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253)
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2526)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:574)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:971)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:929)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:940)
    at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:962)
    at com.test.memcache.impl.XMemcachedClientConnector.get(XMemcachedClientConnector.java:61)

GC pauses before and during TimeoutExceptions are less than 1 second.
Please find attached gc.log, application log, TCP dumps before and after 
TimeoutExceptions, thread dumps before and after TimeoutExceptions.

version of the product:
-----------------------
memcached 1.4.10
http://mdounin.ru/files/repcached-2.3.1-1.4.10.patch.gz
xmemcached-1.3.6
OS: RHEL 5.5
Java 1.7u3
2 CPU E5462 @ 2.80GHz, 8 HW threads in total

Original issue reported on code.google.com by anton.mi...@gmail.com on 28 Apr 2012 at 3:32

Attachments:

xmemcached-timeout.zip

GoogleCodeExporter commented 8 years ago

I am sorry,i am on holiday right now.

Are u sure that when this exception thrown that repcached was running well? Or 
it may block at that moment?

From your description,i can't give you a reasonable explanation.

Original comment by killme2...@gmail.com on 6 May 2012 at 2:56

GoogleCodeExporter commented 8 years ago

No, repcached is not blocked. If repcached was a problem, I would see in TCP 
dump outgoing GET requests and wouldn't see responses.
Instead I see in TCP dump neither outgoing GET requests, no responses.

Also we replaced xmemcached by spymemcached, and everything works just fine.
The issue is in xmemcached.

Original comment by anton.mi...@gmail.com on 6 May 2012 at 5:46

GoogleCodeExporter commented 8 years ago

When you used spymemcached,you set the operation timeout still to be one second 
by future.get?

Original comment by killme2...@gmail.com on 6 May 2012 at 6:53

GoogleCodeExporter commented 8 years ago

You didn't see in TCP dump neither outgoing GET requests, no responses,but it 
didn't mean repcached was no problem.There is a situation that xmemcached could 
not send requests to repcached for some reason.

I don't know have you checked the JVM heap at that moment?

Original comment by killme2...@gmail.com on 6 May 2012 at 7:14

GoogleCodeExporter commented 8 years ago

What's your xmemcached configuration,i will try to reproduce it on my 
machine.Thanks.

Original comment by killme2...@gmail.com on 6 May 2012 at 7:32

GoogleCodeExporter commented 8 years ago

And did you use noreply operation,or some difference methods that you don't use 
in spymemcached?

Original comment by killme2...@gmail.com on 6 May 2012 at 7:33

GoogleCodeExporter commented 8 years ago

"When you used spymemcached,you set the operation timeout still to be one 
second by future.get?"
Yes, timeout is the same 1 second. We have a wrapper component on top of 
memcached client, which allows us to choose either xmemcached or spymemcached. 
Everything (call flow, configuration, ...) is the same.

"You didn't see in TCP dump neither outgoing GET requests, no responses,but it 
didn't mean repcached was no problem.There is a situation that xmemcached could 
not send requests to repcached for some reason."
In this case why spymemcached is able to send requests, when xmemcached is not 
able?

"I don't know have you checked the JVM heap at that moment?"
There is a gc.log in the attachment above. All stop-the-world pauses are much 
less than 1 second.

"And did you use noreply operation,or some difference methods that you don't 
use in spymemcached?"
there is no difference from application point of view.
In both xmemcached and spymemcached configurations the application is doing 
"GET" and "SET noreply" operations.

Original comment by anton.mi...@gmail.com on 6 May 2012 at 8:38

GoogleCodeExporter commented 8 years ago

What's the meaning of this
"2012-04-27 04:40:34,715 [ER-2] ERROR mem-api.cache      : 
XMemcachedClientConnector.get() key:DRT:pcef1.Gx;1317897826;24602895, 
getBeforeCount:49208437, getAfterCount:49208149, getDeltaCount:-288, 
getLatencyMax:126, getLatencyAvg:0.21612556489373336"

is it means that get error value?

And are u using memcached text protocol or binary?

What's your key and value for example?

Original comment by killme2...@gmail.com on 6 May 2012 at 8:50

GoogleCodeExporter commented 8 years ago

"What's the meaning of this"
This is the code, it increments counters before and after "get" operation, also 
latency is mesaured:

    public Object get(String key) {
        try {
            getBeforeCount.incrementAndGet();
            long before = System.currentTimeMillis();

            Object result = client.get(key);

            long latency = System.currentTimeMillis() - before;
            if( getLatencyMax.get() < latency ) {
                getLatencyMax.set( latency );
            }
            getLatencySum.addAndGet( latency );
            getAfterCount.incrementAndGet();
            return result; 
        } catch (Exception e) {
            long lGetBeforeCount = getBeforeCount.get();
            long lGetAfterCount = getAfterCount.get();
            log.error( "XMemcachedClientConnector.get() key:" + key + ", getBeforeCount:" + lGetBeforeCount
                    + ", getAfterCount:" + lGetAfterCount + ", getDeltaCount:" + (lGetAfterCount - lGetBeforeCount)
                    + ", getLatencyMax:" + getLatencyMax.get()
                    + ", getLatencyAvg:" + ( (double) getLatencySum.get() / getAfterCount.get() ) );
            throw new RuntimeException(e);
        }
    }
What's your key and value for example?
"And are u using memcached text protocol or binary?"
Text procotol, but the issue reproduces also with binary protocol.

"What's your key and value for example?"
key: DRT:pcef1.Gx;1317897826;5360
value: 1%server;18050538;2187704472|2%server;18050538;2187704473|

Original comment by anton.mi...@gmail.com on 6 May 2012 at 11:08

GoogleCodeExporter commented 8 years ago

Did you have any periodic task? If you have,were theses tasks executing while 
timeout exception happened?

Original comment by killme2...@gmail.com on 6 May 2012 at 11:26

GoogleCodeExporter commented 8 years ago

No periodic tasks.

Original comment by anton.mi...@gmail.com on 6 May 2012 at 1:00

GoogleCodeExporter commented 8 years ago

As i know that spymemcached doesn't have noreply operation API,how do you use 
it?

Original comment by killme2...@gmail.com on 6 May 2012 at 1:15

GoogleCodeExporter commented 8 years ago

I found that many " omf.overload-manager : System OverloadDetected event:"  
logs in system.log.

It that means your system is over load?  What's the TPS of your set-noreply 
operations at that moment?

Xmemcached's noreply operation just puts set commands to a queue in 
connection,if there are many set-noreply requests at one time,the queue may be 
full and get operation could not be sent to memcached.

Original comment by killme2...@gmail.com on 6 May 2012 at 1:29

GoogleCodeExporter commented 8 years ago

"As i know that spymemcached doesn't have noreply operation API,how do you use 
it?"
There is a method 
net.spy.memcached.MemcachedClient: public Future<Boolean> set(String key, int 
exp, Object o)
We call it and don't use "Future" object, so it is almost like "noreply" 
operation.

"I found that many " omf.overload-manager : System OverloadDetected event:"  
logs in system.log.
It that means your system is over load?"  

It means that incoming queue size for working thread pool reached 5000 
threshold. And  that happended after first TimeoutExceptions, that happened 
because of TimeoutExceptions (and not vice versa), because xmemcached was 
holding working threads for 1 second.

"What's the TPS of your set-noreply operations at that moment?"

it was 200 TPS, the same as TPS of "get" operations.

Original comment by anton.mi...@gmail.com on 6 May 2012 at 6:15

GoogleCodeExporter commented 8 years ago

"What's your xmemcached configuration,i will try to reproduce it on my 
machine.Thanks."

        //singleAdresList corresponds to "127.0.0.1:8091"
        XMemcachedClientBuilder builder = new XMemcachedClientBuilder(singleAddressList);
        builder.setSocketOption(StandardSocketOption.TCP_NODELAY, true);
        builder.setConnectionPoolSize(8);
        builder.setFailureMode(true);
        MemcachedClient client = builder.build();

Original comment by anton.mi...@gmail.com on 7 May 2012 at 12:58

GoogleCodeExporter commented 8 years ago

Thank you.

I will try my best to re-produce this issue.
And if you can,please use set instead of setWithNoReply in your application,and 
run it again to check if the issue would happen again.

I think the setWithNoReply method may be the root cause.

Original comment by killme2...@gmail.com on 7 May 2012 at 1:11

GoogleCodeExporter commented 8 years ago

Hi,it seems that this issue is caused by too many noreply operations.Because 
noreply opertions doesn't wait response from server,so it just be put into a 
queue in application memory and waited to be sent to memcached.
If there are too many noreply operations in queue,it will block the other 
operations and make their response time too large.

In xmemcached 1.3.8, i've added a flow control mechanism for noreply 
operations, use setMaxQueuedNoReplyOperations method in builder or factory bean 
to set the maximum queued noreply operations in memory,the default value is 
based on your jvm maximum memory by -Xmx option.

If you still want to try xmemcached,you can try the 1.3.8 please,thanks.

Original comment by killme2...@gmail.com on 17 Jul 2012 at 5:33

GoogleCodeExporter commented 8 years ago

We workarounded this issue by replacing "setWithNoReply" with "set".

"Hi,it seems that this issue is caused by too many noreply operations.Because 
noreply opertions doesn't wait response from server,so it just be put into a 
queue in application memory and waited to be sent to memcached.
If there are too many noreply operations in queue,it will block the other 
operations and make their response time too large."

That is really strange, because in my use case there is the same number of 
"get" and "setWithNoReply" operations. "get" operation is waiting for response, 
while "setWithNoReply" operation doesn't wait, so "setWithNoReply" operation is 
expected to be faster.
Instead we see that "setWithNoReply" operations stuck and "get" operations 
don't stuck.

"In xmemcached 1.3.8, i've added a flow control mechanism for noreply 
operations, use setMaxQueuedNoReplyOperations method in builder or factory bean 
to set the maximum queued noreply operations in memory,the default value is 
based on your jvm maximum memory by -Xmx option."

I think, that the root cause of the issue is a bug in concurrency in 
xmemcached, and flow control mechanism doesn't fix this bug.

Original comment by anton.mi...@gmail.com on 3 Aug 2012 at 8:05

GoogleCodeExporter commented 8 years ago

We too see this problem after a few hours of operating.
We used 1.3.8 but did upgrade yesterday to 1.4.1, but this problem still exists.

java.util.concurrent.TimeoutException: Timed out(5000) waiting for operation
       at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2565)
       at net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2016)
       at net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2029)

We use this to create the connection(s)

         MemcachedClientBuilder builder = new XMemcachedClientBuilder(AddrUtil.getAddresses(cacheConnection));
            builder.setCommandFactory(new BinaryCommandFactory());//use binary protocol 
            _memcachedClient=builder.build();

and then use it that way:

if (_hasGetAndTouch)
{
    // Unfortunally getAndTouch is not implemented on server side
    retVal= _mClient.getAndTouch(finalKey, MAX_CACHE_LIFETIME);
}
else
{
    retVal= _mClient.get(finalKey);
}

So as you can see, we have the problem on the getAndTouch command,
and not on the set command.

Could it be, that the connection is idle too long and due to this is in some 
way half-closed by either the server or client ?
The errors start the appear ~18 hours after we first make contact with the 
cache.
And the connections we have established are hold open for reuse.

Original comment by an...@schild.ws on 15 Mar 2013 at 7:30

GoogleCodeExporter commented 8 years ago

Here a url for more details on timeouts in memcached

http://code.google.com/p/memcached/wiki/Timeouts

Original comment by an...@schild.ws on 15 Mar 2013 at 9:34

GoogleCodeExporter commented 8 years ago

@andre

Have you tried all these solutions in this document 
http://code.google.com/p/memcached/wiki/Timeouts ?

If so,maybe it has an issue in xmemcached.I have to find it out.

If the connection is idle, xmemcached will send a 'version' command to server  
to make sure that connection is alive.If hearbeating failed, it will close the 
connection and try to heal it.

Original comment by killme2...@gmail.com on 15 Mar 2013 at 10:09

GoogleCodeExporter commented 8 years ago

Yes, but with no obvious errors found

- listen_disabled_num is not a problem
- the servers are all 64 bit, are not using swap and have low load
- dropping packets is not a issue either, since on one of the testservers we 
have no (local) firewall active
- TIME_WAIT is also not a problem

Original comment by an...@schild.ws on 15 Mar 2013 at 10:50

GoogleCodeExporter commented 8 years ago

@andre

do you disable the heartbeating?

And can you find some text like "heartbeat fail" in your log?

At  last,please turn on you java GC log to check if it was doing full GC when 
timeout happend.

Original comment by killme2...@gmail.com on 15 Mar 2013 at 11:24

GoogleCodeExporter commented 8 years ago

No, I did neither enable nor disable heartbeating.

No, I don't see any logs with "hertbeat".

Here the error log of tomcat:
2013-03-20 03:24:28,837 [webseite2] [Thread-412] ERROR 
com.aarboard.aarcatcore.cache.xmemcached.XMemcacheCache - Error in cache for 
key: all
java.util.concurrent.TimeoutException: Timed out(5000) waiting for operation    at 
net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2565)   at

net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2016) 
at 
net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2029) 
at 
com.aarboard.aarcatcore.cache.xmemcached.XMemcacheCache.getEntry(XMemcacheCache.
java:73)    at 

And the corresponding GC logs

2013-03-20T03:24:00.628+0100: 69631.200: [GC 69631.200: [ParNew: 
26240K->674K(29504K), 0.0366990 secs] 1563479K->1537913K(1569600K) icms_dc=100 
, 0.0373240 secs
2013-03-20T03:24:00.907+0100: 69631.478: [GC 69631.479: [ParNew: 
26914K->442K(29504K), 0.0321190 secs] 1564153K->1537681K(1569600K) icms_dc=100 
, 0.0325630 secs
2013-03-20T03:24:01.134+0100: 69631.705: [GC 69631.706: [ParNew: 
26682K->605K(29504K), 0.0299910 secs] 1563921K->1537844K(1569600K) icms_dc=100 
, 0.0304360 secs
2013-03-20T03:24:01.345+0100: 69631.917: [GC 69631.917: [ParNew: 
26845K->794K(29504K), 0.0300840 secs] 1564084K->1538034K(1569600K) icms_dc=100 
, 0.0305380 secs
2013-03-20T03:24:01.565+0100: 69632.137: [GC 69632.137: [ParNew: 
27034K->743K(29504K), 0.0355330 secs] 1564274K->1538187K(1569600K) icms_dc=100 
, 0.0361180 secs
2013-03-20T03:24:02.230+0100: 69632.802: [GC 69632.802: [ParNew: 
26983K->713K(29504K), 0.0311850 secs] 1564427K->1538274K(1569600K) icms_dc=100 
, 0.0316440 secs
2013-03-20T03:24:03.049+0100: 69633.621: [GC 69633.621: [ParNew: 
26953K->937K(29504K), 0.0320370 secs] 1564514K->1538604K(1569600K) icms_dc=100 
, 0.0325100 secs
2013-03-20T03:24:03.857+0100: 69634.429: [GC 69634.429: [ParNew: 
27177K->782K(29504K), 0.0304030 secs] 1564844K->1538547K(1569600K) icms_dc=100 
, 0.0308760 secs
2013-03-20T03:24:04.719+0100: 69635.290: [GC 69635.301: [ParNew: 
27022K->1058K(29504K), 0.0350630 secs] 1564787K->1538913K(1569600K) icms_dc=100 
, 0.0356500 sec
2013-03-20T03:24:05.602+0100: 69636.173: [GC 69636.174: [ParNew: 
27298K->612K(29504K), 0.0350860 secs] 1565153K->1538577K(1569600K) icms_dc=100 
, 0.0355900 secs
2013-03-20T03:24:06.443+0100: 69637.015: [GC 69637.015: [ParNew: 
26852K->806K(29504K), 0.0329730 secs] 1564817K->1538881K(1569600K) icms_dc=100 
, 0.0334430 secs
2013-03-20T03:24:07.279+0100: 69637.851: [GC 69637.851: [ParNew: 
27046K->940K(29504K), 0.0326340 secs] 1565121K->1539122K(1569600K) icms_dc=100 
, 0.0331870 secs
2013-03-20T03:24:08.102+0100: 69638.673: [GC 69638.674: [ParNew: 
27180K->633K(29504K), 0.0307200 secs] 1565362K->1538902K(1569600K) icms_dc=100 
, 0.0313650 secs
2013-03-20T03:24:08.951+0100: 69639.522: [GC 69639.523: [ParNew: 
26873K->700K(29504K), 0.0307750 secs] 1565142K->1539057K(1569600K) icms_dc=100 
, 0.0312460 secs
2013-03-20T03:24:09.197+0100: 69639.768: [CMS-concurrent-mark: 8.422/8.953 
secs] [Times: user=12.14 sys=0.86, real=8.95 secs]
2013-03-20T03:24:09.197+0100: 69639.769: [CMS-concurrent-preclean-start]
2013-03-20T03:24:09.209+0100: 69639.781: [CMS-concurrent-preclean: 0.011/0.012 
secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2013-03-20T03:24:09.210+0100: 69639.782: 
[CMS-concurrent-abortable-preclean-start]
2013-03-20T03:24:09.803+0100: 69640.375: [GC 69640.375: [ParNew: 
26940K->884K(29504K), 0.0306980 secs] 1565297K->1539325K(1569600K) icms_dc=100 
, 0.0413230 secs
2013-03-20T03:24:10.297+0100: 69640.868: [CMS-concurrent-abortable-preclean: 
0.097/1.086 secs] [Times: user=0.50 sys=0.08, real=1.09 secs]
2013-03-20T03:24:10.298+0100: 69640.870: [GC[YG occupancy: 15521 K (29504 
K)]69640.870: [Rescan (parallel) , 0.0418540 secs]69640.912: [weak refs 
processing, 0.
2013-03-20T03:24:10.342+0100: 69640.913: [CMS-concurrent-sweep-start]
2013-03-20T03:24:10.720+0100: 69641.292: [GC 69641.292: [ParNew: 
27124K->1554K(29504K), 0.0325720 secs] 1565564K->1540080K(1569600K) icms_dc=100 
, 0.0330720 sec
2013-03-20T03:24:11.148+0100: 69641.720: [GC 69641.720: [ParNew (promotion 
failed): 27794K->29504K(29504K), 0.0895560 secs]69641.810: 
[CMS2013-03-20T03:24:12.98
 (concurrent mode failure): 1539703K->1539063K(1540096K), 14.0319360 secs] 1566320K->1539063K(1569600K), [CMS Perm : 141329K->141254K(266856K)] icms_dc=100 , 14
2013-03-20T03:24:25.344+0100: 69655.916: [GC [1 CMS-initial-mark: 
1539063K(1540096K)] 1542126K(1569600K), 0.0301270 secs] [Times: user=0.03 
sys=0.00, real=0.03
2013-03-20T03:24:25.377+0100: 69655.948: [CMS-concurrent-mark-start]
2013-03-20T03:24:30.158+0100: 69660.730: [GC 69660.730: [ParNew: 
26240K->26240K(29504K), 0.0000450 secs]69660.730: 
[CMS2013-03-20T03:24:33.221+0100: 69663.792:
 (concurrent mode failure): 1539063K->1538864K(1540096K), 15.2470910 secs] 1565303K->1538864K(1569600K), [CMS Perm : 141299K->141296K(266856K)] icms_dc=100 , 15
2013-03-20T03:24:45.436+0100: 69676.007: [GC [1 CMS-initial-mark: 
1538864K(1540096K)] 1539159K(1569600K), 0.0233200 secs] [Times: user=0.03 
sys=0.00, real=0.03
2013-03-20T03:24:45.460+0100: 69676.032: [CMS-concurrent-mark-start]
2013-03-20T03:24:45.868+0100: 69676.440: [GC 69676.440: [ParNew: 
26240K->2725K(29504K), 0.0297190 secs] 1565104K->1541589K(1569600K) icms_dc=100 
, 0.0301970 sec
2013-03-20T03:24:46.153+0100: 69676.725: [GC 69676.725: [ParNew: 
28965K->709K(29504K), 0.0317680 secs] 1567829K->1539573K(1569600K) icms_dc=100 
, 0.0322110 secs
2013-03-20T03:24:46.427+0100: 69676.999: [GC 69676.999: [ParNew: 
26949K->486K(29504K), 0.0298070 secs] 1565813K->1539351K(1569600K) icms_dc=100 
, 0.0302500 secs
2013-03-20T03:24:46.693+0100: 69677.264: [GC 69677.265: [ParNew: 
26726K->682K(29504K), 0.0302160 secs] 1565591K->1539547K(1569600K) icms_dc=100 
, 0.0306790 secs
2013-03-20T03:24:46.934+0100: 69677.506: [GC 69677.506: [ParNew: 
26922K->648K(29504K), 0.0307740 secs] 1565787K->1539645K(1569600K) icms_dc=100 
, 0.0312550 secs

I noticed, that the first "(concurrent mode failure)" shows up at  
2013-03-20T03:16:59.060+0100, and then it shows up more and more, until it is 
shown on each GC run.

http://pbrd.co/YqjeeJ

Unfortunally I'm not sure what this means.... :(

Original comment by an...@schild.ws on 20 Mar 2013 at 7:35

GoogleCodeExporter commented 8 years ago

Hi, from your GC log,it seems that your application's full GC is paused too 
long:

[CMS-concurrent-mark: 8.422/8.953 secs]

The mark phrase spent 8 seconds,it stopped the all application threads(stop the 
whole world),so at this point,there would be many timeout exception in 
xmemcached.

And "concurrent mode failure" meant you JVM arguments has some problems, if it 
happened,the JVM could not do GC in CMS mode,but stop the whole world.More 
information you can see this blog:

https://blogs.oracle.com/jonthecollector/entry/the_unspoken_cms_and_printgcdetai
ls

I recommend u to turn the GC arguments for you application.

Original comment by killme2...@gmail.com on 20 Mar 2013 at 10:38

GoogleCodeExporter commented 8 years ago

Thanks,
so in my case it's "normal" (or better because of GC working) that we see the 
timeout exceptions

Original comment by an...@schild.ws on 22 Mar 2013 at 8:15

GoogleCodeExporter commented 8 years ago

HI.
I think I'm facing the same problem of timeout. 
I'm using Xmemcached_1.4.1 and java 6 on Tomcat7 app serer. My test runs 
throughout the night, and performs some get, add, and delete operations with 
Memcached.  The process makes many I/O operations but there is only one single 
client users, the CPU consumption is always under 30% and the max Memory Heap 
(1 GB) is never reached (max reach < 400Mb). (So the Full GC collection is not 
enough to explain the issue.) 
After several hours of test, the final operations are beginning, the CPU 
increase from 8% to 25%. 
In the log file I begin to see some Memcached operation timeout Exception: 
java.util.concurrent.TimeoutException: Timed out(5000 milliseconds) waiting for 
operation while connected to (myIpMemcached:portn°)
    at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2617)
    at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:617)
    at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:1030)
    at net.rubyeye.xmemcached.XMemcachedClient.gets(XMemcachedClient.java:1043)
    at net.rubyeye.xmemcached.XMemcachedClient.gets(XMemcachedClient.java:1078
as you can see I'm using the default operation timeout 5000 milliseconds which 
is already very high in comparison with the typical turnaround time of 
Memcached (<2 ms). 
I also monitored the Memcached Server with a dedicated JVM and a java client 
that makes periodic get operations, I did not find any error during this 
monitoring, so I don't think the problems comes from the server but it probably 
comes from the XMemcached library and its behavior in the JVM during this test.

Original comment by t.too...@gmail.com on 18 Nov 2013 at 9:39

AlexSylka commented 8 years ago

where this page was moved http://code.google.com/p/memcached/wiki/Timeouts ? It's not available right now. We have received similar exceptions: java.util.concurrent.ExecutionException: net.spy.memcached.internal.CheckedOperationTimeoutException: Operation timed out. - failing node: *** at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:180) at net.spy.memcached.internal.GetFuture.get(GetFuture.java:69) and cpu load was critical on machine which send requests to memcached which is on separate machine where cpu load was not critical

ZoroXing / xmemcached

Spontaneous batches of TimeoutExceptions after few hours of running #194