OperationTimeout exception with when throughput is high

GoogleCodeExporter commented 8 years ago

What version of the product are you using? On what operating system?

I'm using the latest 2.4.2 build on Linux. It seems that when the write/read 
speed is very high the client keeps reporting OperationTimeoutException and 
freeze the server. So how can I configure the number of threads for async 
get/put? How to make get/put to be synchronous?

Thanks.

-Weijun

Original issue reported on code.google.com by weiju...@gmail.com on 1 Apr 2010 at 7:22

GoogleCodeExporter commented 8 years ago

There's only one thread doing this work.  get always behaves synchronously to 
the point 
of view of the client.  set returns a Future which you can synchronize on.

Original comment by dsalli...@gmail.com on 1 Apr 2010 at 7:30

GoogleCodeExporter commented 8 years ago

same issue for me, its bringing down the server

Original comment by vishnude...@gmail.com on 9 May 2010 at 9:58

GoogleCodeExporter commented 8 years ago

<<So how can I configure the number of threads for async 
get/put

the only way to achieve this is to build a pool of Memcached client objects 
(there is one IO thread per client). But 
we've had the same problem since we transitioned from whalin client to spy. 
Unfortunately (and surprisingly), 
pool strategy did not solve it for us, we are still getting similar number of 
timeouts, no matter how many clients 
we add to the pool.

Original comment by boris.pa...@gmail.com on 24 May 2010 at 12:55

GoogleCodeExporter commented 8 years ago

It seems like we could be hitting a similar issue, on our prod linux 
environment.  So
I've kind replicated on my local machine:

- Install memcached:
https://wincent.com/wiki/Installing_memcached_1.4.4_on_Mac_OS_X_10.6.2_Snow_Leop
ard

- Simple webapp on tomcat that writes and gets from memcached

- Apache Benchmark concurrent load:
(60 concurrent writes, 60 concurrent reads):

ab -r -n 10000 -c 60 http://127.0.0.1:45000/memcached/set/dom/kkk > mset.txt 
2>&1 &
ab -r -n 10000 -c 65 http://127.0.0.1:45000/memcached/get/dom > mget.txt 2>&1 &

With 2.4.2 I see lots of exceptions during the first run of the above.
Subsequent runs the I don't see the exceptions:

[2010.05.28 22:01:17] SpyMemcachedCache -  memcached timed out for
expression bbc-forge-domt-dom Timed out waiting for operation -
failing node: localhost/127.0.0.1:11212
2010-05-28 22:01:17.333 INFO
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl:  Removing
cancelled operation:
net.spy.memcached.protocol.ascii.GetOperationImpl@1bdb52c8
2010-05-28 22:01:17.334 INFO
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl:  Removing
cancelled operation:

With the latest 2.5.  I see lot of exceptions during the first run of the 
above, and
the client disconnects and will not reconnect:

2010-05-28 22:16:16.934 WARN net.spy.memcached.MemcachedConnection:
Closing, and reopening {QA sa=localhost/127.0.0.1:11212, #Rops=0,
#Wops=27760, #iq=0, topRop=null,
topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@598a15ca,
toWrite=0, interested=8}, attempt 11.
2010-05-28 22:16:46.937 INFO net.spy.memcached.MemcachedConnection:
Reconnecting {QA sa=localhost/127.0.0.1:11212, #Rops=0, #Wops=27760,
#iq=0, topRop=null,

2010-05-28 22:17:16.940 WARN net.spy.memcached.MemcachedConnection:
Closing, and reopening {QA sa=localhost/127.0.0.1:11212, #Rops=0,
#Wops=27760, #iq=0, topRop=null,
topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@598a15ca,
toWrite=0, interested=8}, attempt 13.

Sanity check after load test, that memcached is up and running; but the client 
isn't:

- curl the webapp
$ curl http://localhost:45000/memcached/get/dom/kkk
Key: bbc-forge-domt-dom==does not exist

- telnet memcached 
telnet localhost 11212//localhost:45000/memcached/get/dom/kkk

Trying ::1...
Connected to localhost.
Escape character is '^]'.
get bbc-forge-domt-dom
VALUE bbc-forge-domt-dom 0 5
dom

END
quit

Currently it's looking like 2.3.1 client does not have this concurrency issue.  
It
also only seems to occur when there are concurrent SET and GETS.  Going to try 
to do
a bit more testing this weekend.

/dom

Original comment by dominic....@gmail.com on 29 May 2010 at 9:52

GoogleCodeExporter commented 8 years ago

I re-ran my apachebench tests on a patched version of 2.4.2

- got the source from git hub:
git clone git://github.com/dustin/java-memcached-client.git

- created a local branch from 2.4.2 tag:

git branch my2.4.2 2.4.2
git checkout my2.4.2

- reverted from the "Handle operations that are writing and reading at the same 
time"
commit:
http://github.com/dustin/java-memcached-client/commit/32762f8b7908d91de10fb74d90
5398818b1552e7.
of src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java , to the
previous commit
http://github.com/dustin/java-memcached-client/raw/2.4.1/src/main/java/net/spy/m
emcached/protocol/TCPMemcachedNodeImpl.java:

-- revert net.spy.memcached.protocol.TCPMemcachedNodeImpl from:
http://github.com/dustin/java-memcached-client/raw/32762f8b7908d91de10fb74d90539
8818b1552e7/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
-- to the version of net.spy.memcached.protocol.TCPMemcachedNodeImpl here  :
http://github.com/dustin/java-memcached-client/raw/2.4.1/src/main/java/net/spy/m
emcached/protocol/TCPMemcachedNodeImpl.java

- compiled me a local jar (using the pom.xml here:
http://code.google.com/p/spymemcached/wiki/HowtoBuild)

- re-run apachebench, and the tests look good. I see only a couple of 
exceptions,
nothing like the number previous number.

I am yet to run load testing on our production linux environment with this 
patched,
version.  I will hopefully be able to do that early next week (Monday or 
Tuesday)
perhaps.  I'm going to attach the patched version I've made to this comment, 
maybe it
helps you guys; worth a shot I suppose (If you get chance to try it, let me 
know how
you get on).  I can't tell you 100% if this addresses the issues you are 
having.  All
the unit tests in 2.4.2 passed with this patched version; when I build it (btw).

I've not looked at any patch to 2.5, as this I'm not sure the patch I apply to 
2.4.2
will be enough to fix the client not reconnecting issue in 2.5 (If I can time - 
I'll
have a look.. not sure though).  

hope it helps you.
/dom

Original comment by dominic....@gmail.com on 30 May 2010 at 12:21

Attachments:

spy-memcached-2.4.2.jar

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Dominic,
Have you had a chance to load test the patched 2.4.2 build. We are running 
2.4.2 in production env and occasionally running into the timeout exception. We 
have a high throughput service. We have 3 app servers in the cluster and only 
one at any given time gives this timeout exception. Never had all the servers 
go into the same state.
Venu

Original comment by venu.a...@gmail.com on 9 Aug 2010 at 10:30

GoogleCodeExporter commented 8 years ago

Boris.partensky,
Curious why you moved from whalin to spy. We are considering switching from spy 
to whalin as we get Timeoutexception occasionally which we were not able to 
recreate in test env nor were we able to pin down the cause. One thing that is 
consistent about this bug, is the client releases itself from this bottleneck 
after 15minutes. This is always the case.

Original comment by venu.a...@gmail.com on 9 Aug 2010 at 10:33

GoogleCodeExporter commented 8 years ago

venu.alla, sorry. I just read your question. Whalin had it's own set of 
scalability problems for us which made me look into spy to begin with. Besides, 
it is no longer maintained and is way behind the server. I wanted to use 
features like CAS. We also use async sets/deletes a lot, which would have 
required additional coding on our part had we continued with whalin.

Original comment by boris.pa...@gmail.com on 20 Sep 2010 at 1:26

GoogleCodeExporter commented 8 years ago

We are seeing this problem on a regular basis. Whenever the get load is 
increased above a certain threshold we start seeing too many timeouts.

I ran it with trace logging on and noticed that - 

It adds to send Q for some time before actually sending to the server. I could 
not find a configuration setting to change that behavior. This works like a 
multi-threaded client even though it is single. Due to this some of the 
requests could be timing out as they are waiting for others to fill the batch.

Original comment by sanjeev....@adara.com on 8 Jul 2011 at 11:10

GoogleCodeExporter commented 8 years ago

From hunting this down to the root cause, our team has found it to be likely 
caused by the issue noted here: 
http://code.google.com/p/spymemcached/issues/detail?id=186

Original comment by hypefi...@gmail.com on 19 Jul 2011 at 12:14

GoogleCodeExporter commented 8 years ago

I have started work on Issue 186 and it claims to be the cause of this issue.

Original comment by mikewie...@gmail.com on 24 Aug 2011 at 12:13

Changed state: Started
Added labels: Milestone-Release2.7.2

GoogleCodeExporter commented 8 years ago

Issue 186 sounds like it should only affect FailureMode.Redistribute, but I am 
still observing this when using FailureMode.Retry.

Original comment by mewmewb...@gmail.com on 25 Aug 2011 at 8:03

GoogleCodeExporter commented 8 years ago

I tried to step through as much code as possible, and using my IDE to clear the 
full input queue so that I could get past node.addOp. It seems that at least my 
problem is that the selector is stuck in a state that it's no longer able to 
wake up. (The KQueueSelectorImpl has its interruptTriggered flag set to true so 
that no wakeUp call can trigger any change anymore.) Not sure how much help 
this is, but thought that I'd share.

Original comment by mewmewb...@gmail.com on 26 Aug 2011 at 6:19

GoogleCodeExporter commented 8 years ago

We have suffered many timeouts on heavy loaded environment (+ 500 threads). 
After many debug, we discovered analysing de tcpdump that the problem was with 
a old version of memcached installed by ubuntu server.

Original comment by sissobr on 21 Sep 2011 at 5:42

GoogleCodeExporter commented 8 years ago

Can you tell us which version of memcached you had? We're on 1.4.5.

Original comment by mewmewb...@gmail.com on 21 Sep 2011 at 7:23

GoogleCodeExporter commented 8 years ago

like mewmewb...@gmail.com we are also observing timeouts under heavy load while 
using FailureMode.Retry (with membase and vbucket locator)

Original comment by delf...@gmail.com on 21 Sep 2011 at 7:39

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

We were using memcached 1.4.2 provided by Ubuntu lucid. After migrate ubuntu to 
natty the problem was solve. Could be a problem in another library too.

Original comment by sissobr on 24 Sep 2011 at 6:38

GoogleCodeExporter commented 8 years ago

Can someone verify that this issue is still present in Spymemcached 1.7.1? I 
haven't been able to reproduce this issue yet. On another note, Spy 186 has 
been checked into the latest code branch. Since some of you don't think that 
this issue is the related to 186 I will leave this issue open for now.

Original comment by mikewie...@gmail.com on 5 Oct 2011 at 2:30

GoogleCodeExporter commented 8 years ago

Yes, we are having the issue with Spymemcached 1.7.1

Original comment by delf...@gmail.com on 5 Oct 2011 at 2:55

GoogleCodeExporter commented 8 years ago

You may want to try some JVM tuning or multiple clients.  If you find one CPU 
core fully used (meaning constantly at 100% CPU, on Linux, check this with 
mpstat -P ALL), the use of multiple client objects may help.

Also, note this tuning has been helpful for me in the past:
http://www.couchbase.org/wiki/display/membase/Couchbase+Java+Client+Library

Original comment by ingen...@gmail.com on 14 Oct 2011 at 8:14

Added labels: Milestone-Release2.7.3
Removed labels: Milestone-Release2.7.2

GoogleCodeExporter commented 8 years ago

Original comment by ingen...@gmail.com on 15 Oct 2011 at 3:07

Changed title: OperationTimeout exception with when throughput is high
Changed state: NeedInfo
Added labels: Milestone-Release2.7.4
Removed labels: Milestone-Release2.7.3

GoogleCodeExporter commented 8 years ago

We've been experiencing this issue at peak times which caused several downtimes 
at a specific facebook social game. We're (were) using 2.7.3.. sadly we have 
had to switch to xmemcached which has proven to cope with load quite well. I'll 
be glad to try a patched/fixed version in our load test environment.

Original comment by pirat...@gmail.com on 20 Jan 2012 at 4:26

GoogleCodeExporter commented 8 years ago

Thanks for the update.  How did the timeout cause the downtime?  Was there a 
large patch of them or something?

Any thoughts on a test for this?  We did investigate significant time into 
trying to make this better but at some level or another when operation times go 
too long for any reason, it's still correct to let the app know.  

What was your timeout value set to @piratiss?

Original comment by ingen...@gmail.com on 20 Jan 2012 at 4:37

GoogleCodeExporter commented 8 years ago

Hi.. sorry for this late-late reply :)
We were using a timeout of 1s .. so game couldn't get critical user session 
from memcached and that caused some downtime.
We're planning to move from cloud provider and now that I have a proper load 
testing env will try with latest version (2.8.1) and will let you know.
Thanks!

Original comment by pirat...@gmail.com on 7 May 2012 at 10:10

GoogleCodeExporter commented 8 years ago

Quick update: load test env is stable with spy memcached (2s timeout) and 
performs way better than xmemcached (~50%+ req/secs)... So I guess we'll switch 
back to spymemcached... Kudos for the great work!

Original comment by pirat...@gmail.com on 9 May 2012 at 1:45

bigdata4u / spymemcached

OperationTimeout exception with when throughput is high #128