Closed GoogleCodeExporter closed 9 years ago
A few more information on the topic:
Memcached version: 1.4.20
FreeBSD version: 10.0-STABLE #0 r269376
I could not yet reproduce the problem on FreeBSD 9 (which uses accept()).
Test to reproduce the problem on FreeBSD 10 using accept() instead of accept4()
are still running but did not show the problem, yet.
I also attached the test scripts.
My tests typically run with '/test_memcached <server> <port> 1000'.
You might need to increase the number of allowed processes for this though (at
least on OSX)
Original comment by arne.cl...@gmail.com
on 21 Aug 2014 at 1:52
Attachments:
Related FreeBSD ticket: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=192889
Original comment by arne.cl...@gmail.com
on 21 Aug 2014 at 2:40
Tests with accept() instead of accept4() showed the same Problem.
Original comment by arne.cl...@gmail.com
on 22 Aug 2014 at 6:01
Attachments:
Can you get the output of "stats conns" while it's hung up like this? pre-open
a connection to run stats from and such.
Original comment by dorma...@rydia.net
on 22 Aug 2014 at 6:20
and you say that accept() has the same problem, but only under freebsd 10, or
both 9 and 10?
Original comment by dorma...@rydia.net
on 22 Aug 2014 at 6:20
Looks something like this for a couple of rows:
STAT 7767:addr tcp:192.168.6.210:53765
STAT 7767:state conn_read
STAT 7767:secs_since_last_cmd 386
This time it's stuck in CLOSE_WAIT. But as the admins told me they observed
this, too with CLOSE_WAIT slowly converting to CLOSED. I'll wait some minutes
if a can see that happen here, too.
The other tests were run over night, so I could not directly observe that until
now.
Original comment by arne.cl...@gmail.com
on 22 Aug 2014 at 6:42
Attachments:
Our admins did not observe this behavior on FreeBSD 9, only on FreeBSD 10
machines.
However - I did not run my tests against a FreeBSD 9 machine, yet, but I will
today.
Original comment by arne.cl...@gmail.com
on 22 Aug 2014 at 6:49
FreeBSD 9 test is now running (memcached 1.4.13).
The netstat query shows a totally different picture here.
While on FreeBSD 10 the CLOSE_WAIT and CLOSED numbers go up and down all the
time during the test, the numbers stay pretty much stable on the FreeBSD 9
machine.
I do see single CLOSE_WAITS coming in but I don't see any changes in the CLOSED
number at all.
Original comment by arne.cl...@gmail.com
on 22 Aug 2014 at 7:08
I had a closer look a the "stats conns" output on the FreeBSD 10 machine:
7 Connections are in conn_new_cmd:
STAT 3845:addr tcp:192.168.6.210:57992
STAT 3845:state conn_new_cmd
STAT 3845:secs_since_last_cmd 2595
1 Connection is in this conn_listening:
STAT 16:addr tcp:192.168.2.254:11215
STAT 16:state conn_listening
STAT 16:secs_since_last_cmd 2595
Everything else sits idle at state conn_read.
Another interesting thing:
I just closed the telnet session which reduced CLOSE_WAIT by 3 but it is still
stuck.
I tried a reconnect but this as not possible. That's strange because a
connection should have been freed up ...?
The load test had long been killed before I closed the session so there were no
new connects.
Original comment by arne.cl...@gmail.com
on 22 Aug 2014 at 7:31
Two updates:
On the stuck FreeBSD 10 system all connections just went from CLOSE_WAIT to
CLOSED and are now stuck there.
We updated the FreeBSD 9 server to memcached 1.4.20 and are running the test
script there.
The new version shows the same behavior as before.
Original comment by arne.cl...@gmail.com
on 22 Aug 2014 at 8:43
The test on FreeBSD 9 caused a stuck, too but this one looks different.
- The log complains about "accept(): Too many open files" but it's far away
from the limits.
- There are "only" 192 connections permanently stuck in CLOSED.
- I cannot connect via telnet to this server anymore (a still open telnet
connection works though).
- The stats command shows a low number of active connections
I have to note that memcached 1.4.20 is linked against libevent 1.4.14b-stable
on this system which might be a source of this issue (the FreeBSD 10 memcached
is linked against libevent 2.0.21-stable).
We will run another test with an updated libevent on monday.
FreeBSD 9 values:
stats
STAT curr_connections 5
STAT total_connections 17626537
STAT connection_structures 8176
stats conns
STAT 16:addr tcp:192.168.1.53:11222
STAT 16:state conn_listening
STAT 16:secs_since_last_cmd 17825
STAT 17:addr udp:192.168.1.53:11222
STAT 17:state conn_read
STAT 17:secs_since_last_cmd 17825
STAT 18:addr udp:192.168.1.53:11222
STAT 18:state conn_read
STAT 18:secs_since_last_cmd 17825
STAT 19:addr udp:192.168.1.53:11222
STAT 19:state conn_read
STAT 19:secs_since_last_cmd 17825
STAT 20:addr udp:192.168.1.53:11222
STAT 20:state conn_read
STAT 20:secs_since_last_cmd 17825
STAT 22:addr tcp:192.168.6.210:54704
STAT 22:state conn_parse_cmd
STAT 22:secs_since_last_cmd 0
sysctl -a | grep files
kern.maxfiles: 12328
kern.maxfilesperproc: 11095
kern.openfiles: 813
fstat | grep memcached
[... skipped about 20 lines of pipes and regular files ...]
16* internet stream tcp fffffe006599a000
17* internet dgram udp fffffe021126b498
18* internet dgram udp fffffe021126b498
19* internet dgram udp fffffe021126b498
20* internet dgram udp fffffe021126b498
22* internet stream tcp fffffe02c796a3d
Original comment by arne.cl...@gmail.com
on 22 Aug 2014 at 1:56
Sorry if I'm misreading, but the total number of connections shown in "stats
conns" is how many?
Are you using -o maxconns_fast ?
Please attach text files rather than screenshots. I can't grep a picture
without OCR'ing it first, and getting a subset of data instead of the whole
thing isn't useful.
thanks.
Original comment by dorma...@rydia.net
on 23 Aug 2014 at 11:36
I'm not using a different configuration than shown in my initial post, so no -o
maxconns_fast.
Regarding the high connection numbers - the test tries to simulate heavy loads
with a lot of connections going on/off as well as occasionally reaching the
connection limit. This test normally runs for a long time (but not always)
that's why there are so many connections.
I attached a part of the log that got stuck - the last 1.000.000 lines. The
whole log from that test is 24 GB large.
I'll try to reproduce it once more so I can give you stats and stats conns as a
file, too.
Original comment by arne.cl...@gmail.com
on 25 Aug 2014 at 7:09
Attachments:
I mean I'm curious about the full "stats conns" output? I want to see if
memcached thinks those connections are actually still there, or if it has the
sockets open but has forgotten about them.
Original comment by dorma...@rydia.net
on 25 Aug 2014 at 7:50
The current test was still running when I went home.
I already reproduced it once today but the open telnet connection did not react
to any of my commands, so I had to start another test.
I will send you the output of stats and stats conns once I have a working
telnet connection with a stuck test.
Original comment by arne.cl...@gmail.com
on 25 Aug 2014 at 7:59
I tried the test with a smaller amout of max connections - this one got stuck
after just a few seconds:
/usr/local/bin/memcached -u nobody -c 2048 -m 512 -l 192.168.2.157 -p 11215
-vvv > & cache3.log
The connections are still stuck in CLOSE_WAIT at the moment.
I'll attach another logfile when the connections switched to CLOSED
Original comment by arne.cl...@gmail.com
on 26 Aug 2014 at 12:04
Attachments:
Here are the remaining stats commands + stats conns after all connections
switched to closed
Original comment by arne.cl...@gmail.com
on 26 Aug 2014 at 2:45
Attachments:
I see in the stats output that curr_connections is always 455, despite maxconns
being set to 2048.
I recall freebsd having more process limits than linux typically does. Can you
set -c to 256 and re-run your test? Just in case.
Otherwise, I'm not sure offhand. I'll take a look at this in more detail when I
get some time (could be a week or two though).
You said 1.4.17 definitionly doesn't do this? Even when built under freebsd 10?
Original comment by dorma...@rydia.net
on 26 Aug 2014 at 7:36
We tested it on 1.4.13 on FreeBSD 9 -> no problem.
Same system, 1.4.20 + lib event 1.x -> stuck
Same system, 1.4.20 + lib event 2.x -> still running after a couple of hours
What I did see today (FreeBSD 9, 1.4.20, libevent 2) is the following behavior:
- CLOSE_WAIT is 0, then suddenly rising to high numbers in a short time
- In that time the client shows a lot of timeouts
- The client pauses for a short time -> CLOSE_WAIT cleans up and everything is
running as before
On FreeBSD 10 the CLOSE_WAIT goes up and down all the time and the client never
stops for such a long time. So there IS a difference in the way the two systems
are handling connections.
I did not run the test on a -c 256 server as the FreeBSD 9 test is still
running.
Original comment by arne.cl...@gmail.com
on 27 Aug 2014 at 6:13
Stuck with -c 256, too (FreeBSD 10). Did take a lot longer though.
The logfile is 40 GB total so I added "just" the last 1 million rows.
The file c256.txt contains the result of all stats commands.
The file cache4_tail.log.gz contains the log file.
Original comment by arne.cl...@gmail.com
on 2 Sep 2014 at 12:15
Attachments:
I have same problem......
Freebsd 10
Memcached 1.4.20
libevent 1.x
Original comment by gamem...@gmail.com
on 28 Sep 2014 at 4:17
Same problem :(
When it stuck - in kernel log
kernel: sonewconn: pcb 0xfffff80328c68930: Listen queue overflow: 1 already in
queue awaiting acceptance (14114 occurrences)
and this line disappears from netstat -Lan
tcp4 0/0/1024 192.168.0.32.11211
Original comment by mlav...@gmail.com
on 23 Oct 2014 at 8:56
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=192889 some more info.
Original comment by mlav...@gmail.com
on 25 Oct 2014 at 7:23
Any new information on this?
I still haven't set up a freebsd VM to test with, but I'm pretty sure the code
isn't doing anything wrong :( If certain versions of fbsd are broken in some
way that's testable we can disable accept4 for them...
Original comment by dorma...@rydia.net
on 1 Jan 2015 at 6:18
I plan to do a test run on FreeBSD 10.1 this week - we'll see if this makes any
difference.
Original comment by arne.cl...@gmail.com
on 5 Jan 2015 at 8:17
The current test is running:
memcached -u nobody -c 2048 -m 512 -vvv [...]
FreeBSD 10.1-RELEASE (GENERIC) #0 r274401
memcached 1.4.22
libevent 2.0.21-stable
At the moment it behaves a lot more like FreeBSD 9 (CLOSE is low, CLOSE_WAIT
stays zero) but I'll let it run overnight to be sure.
Original comment by arne.cl...@gmail.com
on 5 Jan 2015 at 3:42
The test config mentioned above did not show any error (yet) on a VM and on a
physical server. So the situation has at least improved but as we don't have a
100% reproduction rule I will continue testing with different settings now.
As the bug showed many signs of a race condition maybe the thread related fixes
in 1.4.22 fixed the problem. I did not go through the FreeBSD 10.1 changelog,
yet, but maybe something happened there, too.
Original comment by arne.cl...@gmail.com
on 7 Jan 2015 at 7:55
Further test did not reproduce the error so the chance is high that either
FreeBSD 10.1 or Memcached 1.4.22 fixed it. As we don't know what caused the
problem it is not 100% sure though.
We've upgraded our servers showing the problem to see if the problem disappears
in production, too.
Original comment by arne.cl...@gmail.com
on 13 Jan 2015 at 8:39
You can't run .21 under your test on freebsd 10.1? I really highly doubt
anything we changed in .22 would cause a bug like this.
Original comment by dorma...@rydia.net
on 13 Jan 2015 at 5:58
I will try .20 as this was the version where we originally encountered the
problem.
As I already said this bug shows signs of a race condition and/or overflow
(hard to reproduce, no fixed reproduction rule, happens under heavy load).
The following fixes introduced in .22 may have had an impact:
- Fix potential corruption for incr/decr of 0b items
- slab re-balance is not thread-safe in function do_item_get
- Fix potential corruption in hash table expansion
- use item lock instead of global lock when hash expanding
That is of course if the bug is really on memcached's side.
On the FreeBSD side there were not many changes that are related (see
https://www.freebsd.org/releases/10.1R/relnotes.html):
- SA-14:19.tcp 16 September 2014 Denial of Service in TCP packet processing
- SA-14:08.tcp 30 April 2014 Fix TCP reassembly vulnerability
- A bug in sctp(4) that would allow two listening sockets bound to the same
port has been fixed.
These are all not directly related but changes in that area might have affected
the connection code.
Original comment by arne.cl...@gmail.com
on 14 Jan 2015 at 9:40
The tests with memcached 1.4.20 did not hang on FreeBSD 10.1 either.
So it looks like one of the FreeBSD fixes removed the problem but that's again
just guessing.
The live servers we updated to 10.1 / 1.4.22 are still running without a hitch
so it seems this setup is stable again.
Original comment by arne.cl...@gmail.com
on 15 Jan 2015 at 1:42
The FreeBSD team explained the issue:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=192889#c15, so I guess we can
set this to fixed now.
Original comment by arne.cl...@gmail.com
on 23 Feb 2015 at 8:47
fixed, even though it was an upstream problem :)
Original comment by dorma...@rydia.net
on 5 Jul 2015 at 1:08
Original issue reported on code.google.com by
arne.cl...@gmail.com
on 21 Aug 2014 at 10:19Attachments: