Closed GoogleCodeExporter closed 8 years ago
There's only one thread doing this work. get always behaves synchronously to
the point
of view of the client. set returns a Future which you can synchronize on.
Original comment by dsalli...@gmail.com
on 1 Apr 2010 at 7:30
same issue for me, its bringing down the server
Original comment by vishnude...@gmail.com
on 9 May 2010 at 9:58
<<So how can I configure the number of threads for async
get/put
the only way to achieve this is to build a pool of Memcached client objects
(there is one IO thread per client). But
we've had the same problem since we transitioned from whalin client to spy.
Unfortunately (and surprisingly),
pool strategy did not solve it for us, we are still getting similar number of
timeouts, no matter how many clients
we add to the pool.
Original comment by boris.pa...@gmail.com
on 24 May 2010 at 12:55
It seems like we could be hitting a similar issue, on our prod linux
environment. So
I've kind replicated on my local machine:
- Install memcached:
https://wincent.com/wiki/Installing_memcached_1.4.4_on_Mac_OS_X_10.6.2_Snow_Leop
ard
- Simple webapp on tomcat that writes and gets from memcached
- Apache Benchmark concurrent load:
(60 concurrent writes, 60 concurrent reads):
ab -r -n 10000 -c 60 http://127.0.0.1:45000/memcached/set/dom/kkk > mset.txt
2>&1 &
ab -r -n 10000 -c 65 http://127.0.0.1:45000/memcached/get/dom > mget.txt 2>&1 &
With 2.4.2 I see lots of exceptions during the first run of the above.
Subsequent runs the I don't see the exceptions:
[2010.05.28 22:01:17] SpyMemcachedCache - memcached timed out for
expression bbc-forge-domt-dom Timed out waiting for operation -
failing node: localhost/127.0.0.1:11212
2010-05-28 22:01:17.333 INFO
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl: Removing
cancelled operation:
net.spy.memcached.protocol.ascii.GetOperationImpl@1bdb52c8
2010-05-28 22:01:17.334 INFO
net.spy.memcached.protocol.ascii.AsciiMemcachedNodeImpl: Removing
cancelled operation:
With the latest 2.5. I see lot of exceptions during the first run of the
above, and
the client disconnects and will not reconnect:
2010-05-28 22:16:16.934 WARN net.spy.memcached.MemcachedConnection:
Closing, and reopening {QA sa=localhost/127.0.0.1:11212, #Rops=0,
#Wops=27760, #iq=0, topRop=null,
topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@598a15ca,
toWrite=0, interested=8}, attempt 11.
2010-05-28 22:16:46.937 INFO net.spy.memcached.MemcachedConnection:
Reconnecting {QA sa=localhost/127.0.0.1:11212, #Rops=0, #Wops=27760,
#iq=0, topRop=null,
2010-05-28 22:17:16.940 WARN net.spy.memcached.MemcachedConnection:
Closing, and reopening {QA sa=localhost/127.0.0.1:11212, #Rops=0,
#Wops=27760, #iq=0, topRop=null,
topWop=net.spy.memcached.protocol.ascii.GetOperationImpl@598a15ca,
toWrite=0, interested=8}, attempt 13.
Sanity check after load test, that memcached is up and running; but the client
isn't:
- curl the webapp
$ curl http://localhost:45000/memcached/get/dom/kkk
Key: bbc-forge-domt-dom==does not exist
- telnet memcached
telnet localhost 11212//localhost:45000/memcached/get/dom/kkk
Trying ::1...
Connected to localhost.
Escape character is '^]'.
get bbc-forge-domt-dom
VALUE bbc-forge-domt-dom 0 5
dom
END
quit
Currently it's looking like 2.3.1 client does not have this concurrency issue.
It
also only seems to occur when there are concurrent SET and GETS. Going to try
to do
a bit more testing this weekend.
/dom
Original comment by dominic....@gmail.com
on 29 May 2010 at 9:52
I re-ran my apachebench tests on a patched version of 2.4.2
- got the source from git hub:
git clone git://github.com/dustin/java-memcached-client.git
- created a local branch from 2.4.2 tag:
git branch my2.4.2 2.4.2
git checkout my2.4.2
- reverted from the "Handle operations that are writing and reading at the same
time"
commit:
http://github.com/dustin/java-memcached-client/commit/32762f8b7908d91de10fb74d90
5398818b1552e7.
of src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java , to the
previous commit
http://github.com/dustin/java-memcached-client/raw/2.4.1/src/main/java/net/spy/m
emcached/protocol/TCPMemcachedNodeImpl.java:
-- revert net.spy.memcached.protocol.TCPMemcachedNodeImpl from:
http://github.com/dustin/java-memcached-client/raw/32762f8b7908d91de10fb74d90539
8818b1552e7/src/main/java/net/spy/memcached/protocol/TCPMemcachedNodeImpl.java
-- to the version of net.spy.memcached.protocol.TCPMemcachedNodeImpl here :
http://github.com/dustin/java-memcached-client/raw/2.4.1/src/main/java/net/spy/m
emcached/protocol/TCPMemcachedNodeImpl.java
- compiled me a local jar (using the pom.xml here:
http://code.google.com/p/spymemcached/wiki/HowtoBuild)
- re-run apachebench, and the tests look good. I see only a couple of
exceptions,
nothing like the number previous number.
I am yet to run load testing on our production linux environment with this
patched,
version. I will hopefully be able to do that early next week (Monday or
Tuesday)
perhaps. I'm going to attach the patched version I've made to this comment,
maybe it
helps you guys; worth a shot I suppose (If you get chance to try it, let me
know how
you get on). I can't tell you 100% if this addresses the issues you are
having. All
the unit tests in 2.4.2 passed with this patched version; when I build it (btw).
I've not looked at any patch to 2.5, as this I'm not sure the patch I apply to
2.4.2
will be enough to fix the client not reconnecting issue in 2.5 (If I can time -
I'll
have a look.. not sure though).
hope it helps you.
/dom
Original comment by dominic....@gmail.com
on 30 May 2010 at 12:21
Attachments:
[deleted comment]
Dominic,
Have you had a chance to load test the patched 2.4.2 build. We are running
2.4.2 in production env and occasionally running into the timeout exception. We
have a high throughput service. We have 3 app servers in the cluster and only
one at any given time gives this timeout exception. Never had all the servers
go into the same state.
Venu
Original comment by venu.a...@gmail.com
on 9 Aug 2010 at 10:30
Boris.partensky,
Curious why you moved from whalin to spy. We are considering switching from spy
to whalin as we get Timeoutexception occasionally which we were not able to
recreate in test env nor were we able to pin down the cause. One thing that is
consistent about this bug, is the client releases itself from this bottleneck
after 15minutes. This is always the case.
Original comment by venu.a...@gmail.com
on 9 Aug 2010 at 10:33
venu.alla, sorry. I just read your question. Whalin had it's own set of
scalability problems for us which made me look into spy to begin with. Besides,
it is no longer maintained and is way behind the server. I wanted to use
features like CAS. We also use async sets/deletes a lot, which would have
required additional coding on our part had we continued with whalin.
Original comment by boris.pa...@gmail.com
on 20 Sep 2010 at 1:26
We are seeing this problem on a regular basis. Whenever the get load is
increased above a certain threshold we start seeing too many timeouts.
I ran it with trace logging on and noticed that -
It adds to send Q for some time before actually sending to the server. I could
not find a configuration setting to change that behavior. This works like a
multi-threaded client even though it is single. Due to this some of the
requests could be timing out as they are waiting for others to fill the batch.
Original comment by sanjeev....@adara.com
on 8 Jul 2011 at 11:10
From hunting this down to the root cause, our team has found it to be likely
caused by the issue noted here:
http://code.google.com/p/spymemcached/issues/detail?id=186
Original comment by hypefi...@gmail.com
on 19 Jul 2011 at 12:14
I have started work on Issue 186 and it claims to be the cause of this issue.
Original comment by mikewie...@gmail.com
on 24 Aug 2011 at 12:13
Issue 186 sounds like it should only affect FailureMode.Redistribute, but I am
still observing this when using FailureMode.Retry.
Original comment by mewmewb...@gmail.com
on 25 Aug 2011 at 8:03
I tried to step through as much code as possible, and using my IDE to clear the
full input queue so that I could get past node.addOp. It seems that at least my
problem is that the selector is stuck in a state that it's no longer able to
wake up. (The KQueueSelectorImpl has its interruptTriggered flag set to true so
that no wakeUp call can trigger any change anymore.) Not sure how much help
this is, but thought that I'd share.
Original comment by mewmewb...@gmail.com
on 26 Aug 2011 at 6:19
We have suffered many timeouts on heavy loaded environment (+ 500 threads).
After many debug, we discovered analysing de tcpdump that the problem was with
a old version of memcached installed by ubuntu server.
Original comment by sissobr
on 21 Sep 2011 at 5:42
Can you tell us which version of memcached you had? We're on 1.4.5.
Original comment by mewmewb...@gmail.com
on 21 Sep 2011 at 7:23
like mewmewb...@gmail.com we are also observing timeouts under heavy load while
using FailureMode.Retry (with membase and vbucket locator)
Original comment by delf...@gmail.com
on 21 Sep 2011 at 7:39
[deleted comment]
We were using memcached 1.4.2 provided by Ubuntu lucid. After migrate ubuntu to
natty the problem was solve. Could be a problem in another library too.
Original comment by sissobr
on 24 Sep 2011 at 6:38
Can someone verify that this issue is still present in Spymemcached 1.7.1? I
haven't been able to reproduce this issue yet. On another note, Spy 186 has
been checked into the latest code branch. Since some of you don't think that
this issue is the related to 186 I will leave this issue open for now.
Original comment by mikewie...@gmail.com
on 5 Oct 2011 at 2:30
Yes, we are having the issue with Spymemcached 1.7.1
Original comment by delf...@gmail.com
on 5 Oct 2011 at 2:55
You may want to try some JVM tuning or multiple clients. If you find one CPU
core fully used (meaning constantly at 100% CPU, on Linux, check this with
mpstat -P ALL), the use of multiple client objects may help.
Also, note this tuning has been helpful for me in the past:
http://www.couchbase.org/wiki/display/membase/Couchbase+Java+Client+Library
Original comment by ingen...@gmail.com
on 14 Oct 2011 at 8:14
Original comment by ingen...@gmail.com
on 15 Oct 2011 at 3:07
We've been experiencing this issue at peak times which caused several downtimes
at a specific facebook social game. We're (were) using 2.7.3.. sadly we have
had to switch to xmemcached which has proven to cope with load quite well. I'll
be glad to try a patched/fixed version in our load test environment.
Original comment by pirat...@gmail.com
on 20 Jan 2012 at 4:26
Thanks for the update. How did the timeout cause the downtime? Was there a
large patch of them or something?
Any thoughts on a test for this? We did investigate significant time into
trying to make this better but at some level or another when operation times go
too long for any reason, it's still correct to let the app know.
What was your timeout value set to @piratiss?
Original comment by ingen...@gmail.com
on 20 Jan 2012 at 4:37
Hi.. sorry for this late-late reply :)
We were using a timeout of 1s .. so game couldn't get critical user session
from memcached and that caused some downtime.
We're planning to move from cloud provider and now that I have a proper load
testing env will try with latest version (2.8.1) and will let you know.
Thanks!
Original comment by pirat...@gmail.com
on 7 May 2012 at 10:10
Quick update: load test env is stable with spy memcached (2s timeout) and
performs way better than xmemcached (~50%+ req/secs)... So I guess we'll switch
back to spymemcached... Kudos for the great work!
Original comment by pirat...@gmail.com
on 9 May 2012 at 1:45
Original issue reported on code.google.com by
weiju...@gmail.com
on 1 Apr 2010 at 7:22