Open GoogleCodeExporter opened 9 years ago
I am sorry,i am on holiday right now.
Are u sure that when this exception thrown that repcached was running well? Or
it may block at that moment?
From your description,i can't give you a reasonable explanation.
Original comment by killme2...@gmail.com
on 6 May 2012 at 2:56
No, repcached is not blocked. If repcached was a problem, I would see in TCP
dump outgoing GET requests and wouldn't see responses.
Instead I see in TCP dump neither outgoing GET requests, no responses.
Also we replaced xmemcached by spymemcached, and everything works just fine.
The issue is in xmemcached.
Original comment by anton.mi...@gmail.com
on 6 May 2012 at 5:46
When you used spymemcached,you set the operation timeout still to be one second
by future.get?
Original comment by killme2...@gmail.com
on 6 May 2012 at 6:53
You didn't see in TCP dump neither outgoing GET requests, no responses,but it
didn't mean repcached was no problem.There is a situation that xmemcached could
not send requests to repcached for some reason.
I don't know have you checked the JVM heap at that moment?
Original comment by killme2...@gmail.com
on 6 May 2012 at 7:14
What's your xmemcached configuration,i will try to reproduce it on my
machine.Thanks.
Original comment by killme2...@gmail.com
on 6 May 2012 at 7:32
And did you use noreply operation,or some difference methods that you don't use
in spymemcached?
Original comment by killme2...@gmail.com
on 6 May 2012 at 7:33
"When you used spymemcached,you set the operation timeout still to be one
second by future.get?"
Yes, timeout is the same 1 second. We have a wrapper component on top of
memcached client, which allows us to choose either xmemcached or spymemcached.
Everything (call flow, configuration, ...) is the same.
"You didn't see in TCP dump neither outgoing GET requests, no responses,but it
didn't mean repcached was no problem.There is a situation that xmemcached could
not send requests to repcached for some reason."
In this case why spymemcached is able to send requests, when xmemcached is not
able?
"I don't know have you checked the JVM heap at that moment?"
There is a gc.log in the attachment above. All stop-the-world pauses are much
less than 1 second.
"And did you use noreply operation,or some difference methods that you don't
use in spymemcached?"
there is no difference from application point of view.
In both xmemcached and spymemcached configurations the application is doing
"GET" and "SET noreply" operations.
Original comment by anton.mi...@gmail.com
on 6 May 2012 at 8:38
What's the meaning of this
"2012-04-27 04:40:34,715 [ER-2] ERROR mem-api.cache :
XMemcachedClientConnector.get() key:DRT:pcef1.Gx;1317897826;24602895,
getBeforeCount:49208437, getAfterCount:49208149, getDeltaCount:-288,
getLatencyMax:126, getLatencyAvg:0.21612556489373336"
is it means that get error value?
And are u using memcached text protocol or binary?
What's your key and value for example?
Original comment by killme2...@gmail.com
on 6 May 2012 at 8:50
"What's the meaning of this"
This is the code, it increments counters before and after "get" operation, also
latency is mesaured:
public Object get(String key) {
try {
getBeforeCount.incrementAndGet();
long before = System.currentTimeMillis();
Object result = client.get(key);
long latency = System.currentTimeMillis() - before;
if( getLatencyMax.get() < latency ) {
getLatencyMax.set( latency );
}
getLatencySum.addAndGet( latency );
getAfterCount.incrementAndGet();
return result;
} catch (Exception e) {
long lGetBeforeCount = getBeforeCount.get();
long lGetAfterCount = getAfterCount.get();
log.error( "XMemcachedClientConnector.get() key:" + key + ", getBeforeCount:" + lGetBeforeCount
+ ", getAfterCount:" + lGetAfterCount + ", getDeltaCount:" + (lGetAfterCount - lGetBeforeCount)
+ ", getLatencyMax:" + getLatencyMax.get()
+ ", getLatencyAvg:" + ( (double) getLatencySum.get() / getAfterCount.get() ) );
throw new RuntimeException(e);
}
}
What's your key and value for example?
"And are u using memcached text protocol or binary?"
Text procotol, but the issue reproduces also with binary protocol.
"What's your key and value for example?"
key: DRT:pcef1.Gx;1317897826;5360
value: 1%server;18050538;2187704472|2%server;18050538;2187704473|
Original comment by anton.mi...@gmail.com
on 6 May 2012 at 11:08
Did you have any periodic task? If you have,were theses tasks executing while
timeout exception happened?
Original comment by killme2...@gmail.com
on 6 May 2012 at 11:26
No periodic tasks.
Original comment by anton.mi...@gmail.com
on 6 May 2012 at 1:00
As i know that spymemcached doesn't have noreply operation API,how do you use
it?
Original comment by killme2...@gmail.com
on 6 May 2012 at 1:15
I found that many " omf.overload-manager : System OverloadDetected event:"
logs in system.log.
It that means your system is over load? What's the TPS of your set-noreply
operations at that moment?
Xmemcached's noreply operation just puts set commands to a queue in
connection,if there are many set-noreply requests at one time,the queue may be
full and get operation could not be sent to memcached.
Original comment by killme2...@gmail.com
on 6 May 2012 at 1:29
"As i know that spymemcached doesn't have noreply operation API,how do you use
it?"
There is a method
net.spy.memcached.MemcachedClient: public Future<Boolean> set(String key, int
exp, Object o)
We call it and don't use "Future" object, so it is almost like "noreply"
operation.
"I found that many " omf.overload-manager : System OverloadDetected event:"
logs in system.log.
It that means your system is over load?"
It means that incoming queue size for working thread pool reached 5000
threshold. And that happended after first TimeoutExceptions, that happened
because of TimeoutExceptions (and not vice versa), because xmemcached was
holding working threads for 1 second.
"What's the TPS of your set-noreply operations at that moment?"
it was 200 TPS, the same as TPS of "get" operations.
Original comment by anton.mi...@gmail.com
on 6 May 2012 at 6:15
"What's your xmemcached configuration,i will try to reproduce it on my
machine.Thanks."
//singleAdresList corresponds to "127.0.0.1:8091"
XMemcachedClientBuilder builder = new XMemcachedClientBuilder(singleAddressList);
builder.setSocketOption(StandardSocketOption.TCP_NODELAY, true);
builder.setConnectionPoolSize(8);
builder.setFailureMode(true);
MemcachedClient client = builder.build();
Original comment by anton.mi...@gmail.com
on 7 May 2012 at 12:58
Thank you.
I will try my best to re-produce this issue.
And if you can,please use set instead of setWithNoReply in your application,and
run it again to check if the issue would happen again.
I think the setWithNoReply method may be the root cause.
Original comment by killme2...@gmail.com
on 7 May 2012 at 1:11
Hi,it seems that this issue is caused by too many noreply operations.Because
noreply opertions doesn't wait response from server,so it just be put into a
queue in application memory and waited to be sent to memcached.
If there are too many noreply operations in queue,it will block the other
operations and make their response time too large.
In xmemcached 1.3.8, i've added a flow control mechanism for noreply
operations, use setMaxQueuedNoReplyOperations method in builder or factory bean
to set the maximum queued noreply operations in memory,the default value is
based on your jvm maximum memory by -Xmx option.
If you still want to try xmemcached,you can try the 1.3.8 please,thanks.
Original comment by killme2...@gmail.com
on 17 Jul 2012 at 5:33
We workarounded this issue by replacing "setWithNoReply" with "set".
"Hi,it seems that this issue is caused by too many noreply operations.Because
noreply opertions doesn't wait response from server,so it just be put into a
queue in application memory and waited to be sent to memcached.
If there are too many noreply operations in queue,it will block the other
operations and make their response time too large."
That is really strange, because in my use case there is the same number of
"get" and "setWithNoReply" operations. "get" operation is waiting for response,
while "setWithNoReply" operation doesn't wait, so "setWithNoReply" operation is
expected to be faster.
Instead we see that "setWithNoReply" operations stuck and "get" operations
don't stuck.
"In xmemcached 1.3.8, i've added a flow control mechanism for noreply
operations, use setMaxQueuedNoReplyOperations method in builder or factory bean
to set the maximum queued noreply operations in memory,the default value is
based on your jvm maximum memory by -Xmx option."
I think, that the root cause of the issue is a bug in concurrency in
xmemcached, and flow control mechanism doesn't fix this bug.
Original comment by anton.mi...@gmail.com
on 3 Aug 2012 at 8:05
We too see this problem after a few hours of operating.
We used 1.3.8 but did upgrade yesterday to 1.4.1, but this problem still exists.
java.util.concurrent.TimeoutException: Timed out(5000) waiting for operation
at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2565)
at net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2016)
at net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2029)
We use this to create the connection(s)
MemcachedClientBuilder builder = new XMemcachedClientBuilder(AddrUtil.getAddresses(cacheConnection));
builder.setCommandFactory(new BinaryCommandFactory());//use binary protocol
_memcachedClient=builder.build();
and then use it that way:
if (_hasGetAndTouch)
{
// Unfortunally getAndTouch is not implemented on server side
retVal= _mClient.getAndTouch(finalKey, MAX_CACHE_LIFETIME);
}
else
{
retVal= _mClient.get(finalKey);
}
So as you can see, we have the problem on the getAndTouch command,
and not on the set command.
Could it be, that the connection is idle too long and due to this is in some
way half-closed by either the server or client ?
The errors start the appear ~18 hours after we first make contact with the
cache.
And the connections we have established are hold open for reuse.
Original comment by an...@schild.ws
on 15 Mar 2013 at 7:30
Here a url for more details on timeouts in memcached
http://code.google.com/p/memcached/wiki/Timeouts
Original comment by an...@schild.ws
on 15 Mar 2013 at 9:34
@andre
Have you tried all these solutions in this document
http://code.google.com/p/memcached/wiki/Timeouts ?
If so,maybe it has an issue in xmemcached.I have to find it out.
If the connection is idle, xmemcached will send a 'version' command to server
to make sure that connection is alive.If hearbeating failed, it will close the
connection and try to heal it.
Original comment by killme2...@gmail.com
on 15 Mar 2013 at 10:09
Yes, but with no obvious errors found
- listen_disabled_num is not a problem
- the servers are all 64 bit, are not using swap and have low load
- dropping packets is not a issue either, since on one of the testservers we
have no (local) firewall active
- TIME_WAIT is also not a problem
Original comment by an...@schild.ws
on 15 Mar 2013 at 10:50
@andre
do you disable the heartbeating?
And can you find some text like "heartbeat fail" in your log?
At last,please turn on you java GC log to check if it was doing full GC when
timeout happend.
Original comment by killme2...@gmail.com
on 15 Mar 2013 at 11:24
No, I did neither enable nor disable heartbeating.
No, I don't see any logs with "hertbeat".
Here the error log of tomcat:
2013-03-20 03:24:28,837 [webseite2] [Thread-412] ERROR
com.aarboard.aarcatcore.cache.xmemcached.XMemcacheCache - Error in cache for
key: all
java.util.concurrent.TimeoutException: Timed out(5000) waiting for operation at
net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2565) at
net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2016)
at
net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2029)
at
com.aarboard.aarcatcore.cache.xmemcached.XMemcacheCache.getEntry(XMemcacheCache.
java:73) at
And the corresponding GC logs
2013-03-20T03:24:00.628+0100: 69631.200: [GC 69631.200: [ParNew:
26240K->674K(29504K), 0.0366990 secs] 1563479K->1537913K(1569600K) icms_dc=100
, 0.0373240 secs
2013-03-20T03:24:00.907+0100: 69631.478: [GC 69631.479: [ParNew:
26914K->442K(29504K), 0.0321190 secs] 1564153K->1537681K(1569600K) icms_dc=100
, 0.0325630 secs
2013-03-20T03:24:01.134+0100: 69631.705: [GC 69631.706: [ParNew:
26682K->605K(29504K), 0.0299910 secs] 1563921K->1537844K(1569600K) icms_dc=100
, 0.0304360 secs
2013-03-20T03:24:01.345+0100: 69631.917: [GC 69631.917: [ParNew:
26845K->794K(29504K), 0.0300840 secs] 1564084K->1538034K(1569600K) icms_dc=100
, 0.0305380 secs
2013-03-20T03:24:01.565+0100: 69632.137: [GC 69632.137: [ParNew:
27034K->743K(29504K), 0.0355330 secs] 1564274K->1538187K(1569600K) icms_dc=100
, 0.0361180 secs
2013-03-20T03:24:02.230+0100: 69632.802: [GC 69632.802: [ParNew:
26983K->713K(29504K), 0.0311850 secs] 1564427K->1538274K(1569600K) icms_dc=100
, 0.0316440 secs
2013-03-20T03:24:03.049+0100: 69633.621: [GC 69633.621: [ParNew:
26953K->937K(29504K), 0.0320370 secs] 1564514K->1538604K(1569600K) icms_dc=100
, 0.0325100 secs
2013-03-20T03:24:03.857+0100: 69634.429: [GC 69634.429: [ParNew:
27177K->782K(29504K), 0.0304030 secs] 1564844K->1538547K(1569600K) icms_dc=100
, 0.0308760 secs
2013-03-20T03:24:04.719+0100: 69635.290: [GC 69635.301: [ParNew:
27022K->1058K(29504K), 0.0350630 secs] 1564787K->1538913K(1569600K) icms_dc=100
, 0.0356500 sec
2013-03-20T03:24:05.602+0100: 69636.173: [GC 69636.174: [ParNew:
27298K->612K(29504K), 0.0350860 secs] 1565153K->1538577K(1569600K) icms_dc=100
, 0.0355900 secs
2013-03-20T03:24:06.443+0100: 69637.015: [GC 69637.015: [ParNew:
26852K->806K(29504K), 0.0329730 secs] 1564817K->1538881K(1569600K) icms_dc=100
, 0.0334430 secs
2013-03-20T03:24:07.279+0100: 69637.851: [GC 69637.851: [ParNew:
27046K->940K(29504K), 0.0326340 secs] 1565121K->1539122K(1569600K) icms_dc=100
, 0.0331870 secs
2013-03-20T03:24:08.102+0100: 69638.673: [GC 69638.674: [ParNew:
27180K->633K(29504K), 0.0307200 secs] 1565362K->1538902K(1569600K) icms_dc=100
, 0.0313650 secs
2013-03-20T03:24:08.951+0100: 69639.522: [GC 69639.523: [ParNew:
26873K->700K(29504K), 0.0307750 secs] 1565142K->1539057K(1569600K) icms_dc=100
, 0.0312460 secs
2013-03-20T03:24:09.197+0100: 69639.768: [CMS-concurrent-mark: 8.422/8.953
secs] [Times: user=12.14 sys=0.86, real=8.95 secs]
2013-03-20T03:24:09.197+0100: 69639.769: [CMS-concurrent-preclean-start]
2013-03-20T03:24:09.209+0100: 69639.781: [CMS-concurrent-preclean: 0.011/0.012
secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2013-03-20T03:24:09.210+0100: 69639.782:
[CMS-concurrent-abortable-preclean-start]
2013-03-20T03:24:09.803+0100: 69640.375: [GC 69640.375: [ParNew:
26940K->884K(29504K), 0.0306980 secs] 1565297K->1539325K(1569600K) icms_dc=100
, 0.0413230 secs
2013-03-20T03:24:10.297+0100: 69640.868: [CMS-concurrent-abortable-preclean:
0.097/1.086 secs] [Times: user=0.50 sys=0.08, real=1.09 secs]
2013-03-20T03:24:10.298+0100: 69640.870: [GC[YG occupancy: 15521 K (29504
K)]69640.870: [Rescan (parallel) , 0.0418540 secs]69640.912: [weak refs
processing, 0.
2013-03-20T03:24:10.342+0100: 69640.913: [CMS-concurrent-sweep-start]
2013-03-20T03:24:10.720+0100: 69641.292: [GC 69641.292: [ParNew:
27124K->1554K(29504K), 0.0325720 secs] 1565564K->1540080K(1569600K) icms_dc=100
, 0.0330720 sec
2013-03-20T03:24:11.148+0100: 69641.720: [GC 69641.720: [ParNew (promotion
failed): 27794K->29504K(29504K), 0.0895560 secs]69641.810:
[CMS2013-03-20T03:24:12.98
(concurrent mode failure): 1539703K->1539063K(1540096K), 14.0319360 secs] 1566320K->1539063K(1569600K), [CMS Perm : 141329K->141254K(266856K)] icms_dc=100 , 14
2013-03-20T03:24:25.344+0100: 69655.916: [GC [1 CMS-initial-mark:
1539063K(1540096K)] 1542126K(1569600K), 0.0301270 secs] [Times: user=0.03
sys=0.00, real=0.03
2013-03-20T03:24:25.377+0100: 69655.948: [CMS-concurrent-mark-start]
2013-03-20T03:24:30.158+0100: 69660.730: [GC 69660.730: [ParNew:
26240K->26240K(29504K), 0.0000450 secs]69660.730:
[CMS2013-03-20T03:24:33.221+0100: 69663.792:
(concurrent mode failure): 1539063K->1538864K(1540096K), 15.2470910 secs] 1565303K->1538864K(1569600K), [CMS Perm : 141299K->141296K(266856K)] icms_dc=100 , 15
2013-03-20T03:24:45.436+0100: 69676.007: [GC [1 CMS-initial-mark:
1538864K(1540096K)] 1539159K(1569600K), 0.0233200 secs] [Times: user=0.03
sys=0.00, real=0.03
2013-03-20T03:24:45.460+0100: 69676.032: [CMS-concurrent-mark-start]
2013-03-20T03:24:45.868+0100: 69676.440: [GC 69676.440: [ParNew:
26240K->2725K(29504K), 0.0297190 secs] 1565104K->1541589K(1569600K) icms_dc=100
, 0.0301970 sec
2013-03-20T03:24:46.153+0100: 69676.725: [GC 69676.725: [ParNew:
28965K->709K(29504K), 0.0317680 secs] 1567829K->1539573K(1569600K) icms_dc=100
, 0.0322110 secs
2013-03-20T03:24:46.427+0100: 69676.999: [GC 69676.999: [ParNew:
26949K->486K(29504K), 0.0298070 secs] 1565813K->1539351K(1569600K) icms_dc=100
, 0.0302500 secs
2013-03-20T03:24:46.693+0100: 69677.264: [GC 69677.265: [ParNew:
26726K->682K(29504K), 0.0302160 secs] 1565591K->1539547K(1569600K) icms_dc=100
, 0.0306790 secs
2013-03-20T03:24:46.934+0100: 69677.506: [GC 69677.506: [ParNew:
26922K->648K(29504K), 0.0307740 secs] 1565787K->1539645K(1569600K) icms_dc=100
, 0.0312550 secs
I noticed, that the first "(concurrent mode failure)" shows up at
2013-03-20T03:16:59.060+0100, and then it shows up more and more, until it is
shown on each GC run.
http://pbrd.co/YqjeeJ
Unfortunally I'm not sure what this means.... :(
Original comment by an...@schild.ws
on 20 Mar 2013 at 7:35
Hi, from your GC log,it seems that your application's full GC is paused too
long:
[CMS-concurrent-mark: 8.422/8.953 secs]
The mark phrase spent 8 seconds,it stopped the all application threads(stop the
whole world),so at this point,there would be many timeout exception in
xmemcached.
And "concurrent mode failure" meant you JVM arguments has some problems, if it
happened,the JVM could not do GC in CMS mode,but stop the whole world.More
information you can see this blog:
https://blogs.oracle.com/jonthecollector/entry/the_unspoken_cms_and_printgcdetai
ls
I recommend u to turn the GC arguments for you application.
Original comment by killme2...@gmail.com
on 20 Mar 2013 at 10:38
Thanks,
so in my case it's "normal" (or better because of GC working) that we see the
timeout exceptions
Original comment by an...@schild.ws
on 22 Mar 2013 at 8:15
HI.
I think I'm facing the same problem of timeout.
I'm using Xmemcached_1.4.1 and java 6 on Tomcat7 app serer. My test runs
throughout the night, and performs some get, add, and delete operations with
Memcached. The process makes many I/O operations but there is only one single
client users, the CPU consumption is always under 30% and the max Memory Heap
(1 GB) is never reached (max reach < 400Mb). (So the Full GC collection is not
enough to explain the issue.)
After several hours of test, the final operations are beginning, the CPU
increase from 8% to 25%.
In the log file I begin to see some Memcached operation timeout Exception:
java.util.concurrent.TimeoutException: Timed out(5000 milliseconds) waiting for
operation while connected to (myIpMemcached:portn°)
at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2617)
at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:617)
at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:1030)
at net.rubyeye.xmemcached.XMemcachedClient.gets(XMemcachedClient.java:1043)
at net.rubyeye.xmemcached.XMemcachedClient.gets(XMemcachedClient.java:1078
as you can see I'm using the default operation timeout 5000 milliseconds which
is already very high in comparison with the typical turnaround time of
Memcached (<2 ms).
I also monitored the Memcached Server with a dedicated JVM and a java client
that makes periodic get operations, I did not find any error during this
monitoring, so I don't think the problems comes from the server but it probably
comes from the XMemcached library and its behavior in the JVM during this test.
Original comment by t.too...@gmail.com
on 18 Nov 2013 at 9:39
Original issue reported on code.google.com by
anton.mi...@gmail.com
on 28 Apr 2012 at 3:32Attachments: