Memcached 1.4.2 server segmentation fault

GoogleCodeExporter commented 9 years ago

Hi guys,

This was an issue I had with version 1.4.0 and submitted a report in the
group message board.  Dustin responded to me:

"We've got a new release coming very soon (fixes over 1.4.0).  This looks
like the kind of thing that needs to be there."

I mistook "that needs to be there" as "bug will be fixed in the new
releases", but now I think he meant "submitted to the bug tracker".

Cutting and pasting from the original posting
(http://groups.google.com/group/memcached/browse_thread/thread/778dfe3968a7f74a)

[ post 1 ]

Hi guys,

I have been trying to diagnose a memcached segmentation fault.  It is
running on a 32 bit Linux 2.6.20 build (Gentoo).  The memcached server
is being connected through the standard PHP (5.2.6 module to Apache 2)
client.  After a few reloads of the provided script, the server will
crash (increasing the data size seems to accelerate it).

The server is started as follows:

/usr/bin/memcached -l localhost -vvv

It will crash with the following dumped to the terminal:

...
<30 new auto-negotiating client connection
30: going from conn_new_cmd to conn_waiting
30: going from conn_waiting to conn_read
30: going from conn_read to conn_parse_cmd
30: Client using the ascii protocol
<30 set test 0 10 50526
30: going from conn_parse_cmd to conn_nread
> FOUND KEY test
>30 STORED

30: going from conn_nread to conn_write
30: going from conn_write to conn_new_cmd
30: going from conn_new_cmd to conn_waiting
30: going from conn_waiting to conn_read
30: going from conn_read to conn_closing
<30 connection closed.
<30 new auto-negotiating client connection
30: going from conn_new_cmd to conn_waiting
30: going from conn_waiting to conn_read
30: going from conn_read to conn_parse_cmd
30: Client using the ascii protocol
<30 get test
> FOUND KEY test
>30 sending key test
>30 END

30: going from conn_parse_cmd to conn_mwrite
30: going from conn_mwrite to conn_new_cmd
30: going from conn_new_cmd to conn_waiting
30: going from conn_waiting to conn_read
30: going from conn_read to conn_closing
<30 connection closed.
<30 new auto-negotiating client connection
30: going from conn_new_cmd to conn_waiting
30: going from conn_waiting to conn_read
30: going from conn_read to conn_parse_cmd
30: Client using the ascii protocol
<30 set test 0 10 50526
30: going from conn_parse_cmd to conn_nread
> FOUND KEY test
>30 STORED

30: going from conn_nread to conn_write
30: going from conn_write to conn_new_cmd
30: going from conn_new_cmd to conn_waiting
30: going from conn_waiting to conn_read
<31 new auto-negotiating client connection
31: going from conn_new_cmd to conn_waiting
31: going from conn_waiting to conn_read
31: going from conn_read to conn_parse_cmd
31: Client using the ascii protocol
<31 get test
> FOUND KEY test
>31 sending key test
>31 END

31: going from conn_parse_cmd to conn_mwrite
Segmentation fault

The PHP code used to produce this is:

<?php
function _memcache_test($k, $d)
{
        echo "memcache_connect\n";
        $mc = memcache_connect('localhost', 11211, 3);

        echo "memcache_set\n";
        memcache_set($mc, $k, $d, 0, 10);

        echo "memcache_close\n";
        memcache_close($mc);

        echo "memcache_connect\n";
        $mc = memcache_connect('localhost', 11211, 3);

        echo "memcache_get\n";
        $t = memcache_get($mc, $k);

        echo 'strlen(t): ', strlen($t), "\n";

        echo "memcache_close\n";
        memcache_close($mc);

}

echo '<pre>';

$sz = 50526;

$s = str_pad('', $sz, '0');

echo 'strlen(s): ', strlen($s), "\n";

_memcache_test('test', $s);
?>

Has anyone else seen anything like this?

Thanks,
thom

[ post 2 ]

Hi again,

Okay, I think I may have found the problem.  When sendmsg() (in
transmit()) returns -1 (EAGAIN - write will block), the update_event()
function encounters a segmentation fault when calling event_add().
This is probably because the previous call of event_base_set() passed
in an invalid pointer.  The code assumes that event.ev_base will
contain a valid pointer to the event base assigned to the event by a
previous call to either event_base_set() or event_add().  Maybe the
behavior pertaining to this internal event member changed?  Here is
sample code compiled against libevent-1.4.12:

#include <sys/types.h>
#include <stdio.h>
#include <event.h>

static void event_handler(const int fd, const short which, void * arg)
{ }

int
main()
{
        struct event_base               *eb;
        struct event                    ev;

        eb = event_init();

        ev.ev_base = NULL;

        printf("event_base: %p\n", (void *) eb);
        printf("event.ev_base: %p\n", (void *) ev.ev_base);

        event_set(&ev, 1, EV_WRITE | EV_PERSIST, event_handler, NULL);

        event_base_set(eb, &ev);
        printf("event.ev_base (event_base_set): %p\n", (void *)
ev.ev_base);

        event_add(&ev, 0);
        printf("event.ev_base (event_add): %p\n", (void *)
ev.ev_base);

}

It generates the following output:

event_base: 0x804b008
event.ev_base: (nil)
event.ev_base (event_base_set): (nil)
event.ev_base (event_add): (nil)

(note how the pointer value never changes)

The problem seems to have been resolved after applying the following
patch:

*** memcached-1.4.0/memcached.c Thu Jul  9 13:16:24 2009
--- memcached-1.4.0-patch/memcached.c   Wed Jul 29 12:04:04 2009
***************
*** 329,334 ****
--- 329,336 ----
          }
          MEMCACHED_CONN_CREATE(c);

+         c->event_base = base;
+
          c->rbuf = c->wbuf = 0;
          c->ilist = 0;
          c->suffixlist = 0;
***************
*** 3027,3033 ****
  static bool update_event(conn *c, const int new_flags) {
      assert(c != NULL);

!     struct event_base *base = c->event.ev_base;
      if (c->ev_flags == new_flags)
          return true;
      if (event_del(&c->event) == -1) return false;
--- 3029,3035 ----
  static bool update_event(conn *c, const int new_flags) {
      assert(c != NULL);

!     struct event_base *base = c->event_base;
      if (c->ev_flags == new_flags)
          return true;
      if (event_del(&c->event) == -1) return false;
diff -crB memcached-1.4.0/memcached.h memcached-1.4.0-patch/
memcached.h
*** memcached-1.4.0/memcached.h Thu Jul  9 13:16:24 2009
--- memcached-1.4.0-patch/memcached.h   Wed Jul 29 12:04:36 2009
***************
*** 316,321 ****
--- 316,322 ----
      enum conn_states  state;
      enum bin_substates substate;
      struct event event;
+     struct event_base *event_base;
      short  ev_flags;
      short  which;   /** which events were just triggered */

Hope this thread helps someone else.  Is this the proper channel to
report a possible bug to the memcached team?  If not, can someone
point me to the right channel.

Thanks,
thom

[ post 4 ]

On Jul 29, 12:47 pm, Dustin <dsalli...@gmail.com> wrote:

>   I'm looking at this.  I've never heard of an issue here, and I'm not
> entirely sure how to test for it proactively at the moment, but you
> seem to have done quite a good job of investigating it.  Thanks a lot.

One really dumb way that I was able to reproduce the error every time
(in testing for the possible crash cause) was to update_event(c,
EV_WRITE | EV_PERSIST) in drive_machine() at the conn_waiting case
(before the EV_READ).  The current event flags in the connection data
type are set to EV_READ | EV_PERSIST, so this call gets past the
_redundant request_ check in update_event().

Thanks for looking at this and I am looking forward to hearing your
assessment.

thom

---

Thank you for looking at this (again).  I had to patch our build of 1.4.2
to stop it from periodically crashing.

thom

Original issue reported on code.google.com by thomc...@gmail.com on 20 Oct 2009 at 7:33

GoogleCodeExporter commented 9 years ago

It's been difficult to reproduce this issue. I'll be taking a more careful look 
into
it tonight and hopefully resolving a patch.

Original comment by dorma...@rydia.net on 31 Oct 2009 at 1:49

GoogleCodeExporter commented 9 years ago

I'm going to punt on this from 1.4.3-rc1 - if we can reproduce and find the bug
before 1.4.3-final we'll include it, however.

I'm staring at this really hard and while the proposed patch should fix it (and 
is
how we do things in some other libevent-based projects), the claim of where it
crashed is sounding impossible via logic errors.

event_del() relies on c->event.ev_base being *correct*. So in this case 
somewhere
between event_del() and event_base_set(), the '*base' pointer is getting 
removed or
corrupted.

It's also possible the segfault is really in event_del() but isn't manifesting 
until
later? Maybe? I'd like to work with the user with full stack traces and a core 
dump..

So again, I feel like the patch will fix the problem, but also that it would 
hide
something that's potentially way more serious. I'd like to put some extra 
scruitiny
on the bug just in case. Apologies for it taking so long to take a look, and 
thanks
for filing a great bug report :)

Original comment by dorma...@rydia.net on 2 Nov 2009 at 3:22

GoogleCodeExporter commented 9 years ago

Thanks for taking a look at this.  The sample code that I provided (testing how
libevent behaves in regards to event.ev_base) seems to indicate that 
event.ev_base is
never altered by event_base_set().

Let me know which version of memcached you want me to build and if attaching the
subsequent core file to this thread is the preferred means of getting the file 
to
you.  I had to reload the test script in rapid succession several times to 
cause it
to fault because it has to encounter a state where writing will block.  Here is 
a
quick snapshot of what happened:

<31 new auto-negotiating client connection                                      
31: going from conn_new_cmd to conn_waiting                                     
31: going from conn_waiting to conn_read                                        
31: going from conn_read to conn_parse_cmd                                      
31: Client using the ascii protocol                                             
<31 set test 0 10 50526                                                         
31: going from conn_parse_cmd to conn_nread                                     
> FOUND KEY test                                                                
>31 STORED                                                                      
31: going from conn_nread to conn_write                                         
31: going from conn_write to conn_new_cmd                                       
31: going from conn_new_cmd to conn_waiting                                     
31: going from conn_waiting to conn_read                                        
31: going from conn_read to conn_closing                                        
<31 connection closed.                                                          
<31 new auto-negotiating client connection                                      
31: going from conn_new_cmd to conn_waiting                                     
31: going from conn_waiting to conn_read                                        
31: going from conn_read to conn_parse_cmd                                      
31: Client using the ascii protocol                                             
<31 get test                                                                    
> FOUND KEY test                                                                
>31 sending key test                                                            
>31 END                                                                         
31: going from conn_parse_cmd to conn_mwrite                                    

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb65b9b90 (LWP 32183)]        
0xb7f8f436 in event_add () from /usr/lib/libevent-1.4.so.2
(gdb) backtrace                                           
#0  0xb7f8f436 in event_add () from /usr/lib/libevent-1.4.so.2
#1  0x5ff40000 in ?? ()                                       
#2  0x0000b7f8 in ?? ()                                       
#3  0x080e4b38 in ?? ()                                       
#4  0x080e9260 in ?? ()                                       
#5  0xb7f7843e in pthread_mutex_lock () from /lib/libpthread.so.0
#6  0xc35d5f5e in ?? ()                                          
#7  0xffffffb8 in ?? ()                                          
#8  0x1cc483ff in ?? ()                                          
#9  0x5d5f5e5b in ?? ()                                          
#10 0x85838dc3 in ?? ()
...
#265 0x02fe8344 in ?? ()
#266 0x07c7ba75 in ?? ()
#267 0x00000000 in ?? ()
(gdb) gcore
Saved corefile core.32177

gdb ./memcached-debug core.32177
GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Core was generated by `/home/thom/memcached-1.4.2/memcached-debug -l localhost 
-vvv'.
Program terminated with signal 11, Segmentation fault.
[New process 32184]
[New process 32183]
[New process 32182]
[New process 32181]
[New process 32180]
[New process 32177]
#0  0xffffe410 in __kernel_vsyscall ()
(gdb) backtrace
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xb7f7a8c5 in ?? ()
#2  0x00000000 in ?? ()
(gdb)

Hope this helps.

Original comment by thomc...@gmail.com on 2 Nov 2009 at 3:12

GoogleCodeExporter commented 9 years ago

Build against the latest -rc. Use the memcached-debug binary so the symbols 
aren't
stripped. That's the only difference... aside from that and assert's being on. 
Maybe
an assert will be triggered instead, who knows...

Can you contact me off-list for a few rounds of debugging? We'll post results 
back
into the ticket.

Original comment by dorma...@rydia.net on 4 Nov 2009 at 5:51

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I've hit the same problem now on 1.4.5. The patch above seemed to fix the 
problem.
Server was crushing after ~ 15 minutes. We have a script which starts to fill 
memcached actively. 
Debug output was:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb6baeb90 (LWP 29069)]
0x00365627 in event_add () from /usr/lib/libevent-1.4.so.2
#0  0x00365627 in event_add () from /usr/lib/libevent-1.4.so.2
#1  0x08049bb0 in update_event (c=0x80c31f0, new_flags=20) at memcached.c:3313
#2  0x080520c6 in drive_machine (fd=35, which=2, arg=0x80c31f0) at 
memcached.c:3394
#3  event_handler (fd=35, which=2, arg=0x80c31f0) at memcached.c:3732
#4  0x00365188 in event_base_loop () from /usr/lib/libevent-1.4.so.2
#5  0x080559f6 in worker_libevent (arg=0x8066368) at thread.c:245
#6  0x00b91832 in start_thread () from /lib/libpthread.so.0
#7  0x00b0fe0e in clone () from /lib/libc.so.6

I applied the patch and now it works for 1 hour.

Original comment by mrd...@gmail.com on 8 Nov 2010 at 4:29

GoogleCodeExporter commented 9 years ago

I confirm the issue and the value of the patch.  I replaced a few days ago a 
memcached 1.4 that was crashing every few hours with a patched 1.4.5 which has 
been running since then, more than 60h.

Original comment by trudea...@gmail.com on 6 Dec 2010 at 3:57

GoogleCodeExporter commented 9 years ago

I wonder how other people can use memcached without that patch.

Original comment by mrd...@gmail.com on 6 Dec 2010 at 4:01

GoogleCodeExporter commented 9 years ago

I have exactly the same problem since months, and we set up a cron job to 
restart memcached after every crash... I went from 1.4.0 to 1.4.5 and it still 
crashes (daily crash). Will it be in the next release ?

I feel confident on this patch but i never patch any program... If someone has 
a nice tutorial to help me, i will try on my own.

Thank you.

Original comment by rit...@gmail.com on 9 Dec 2010 at 12:38

GoogleCodeExporter commented 9 years ago

Hi, Rituel.

To apply the patch you need to copy it into a file. Lets say you called it 
memcached-issue99-suggested-patch
Copy everything exactly from the first line (*** memcached-1.4.0/memcached.c 
Thu Jul  9 13:16:24 2009) to the last (short  which;   /** which events were 
just triggered */
).

Put that file into a directory with unpacked memcached source and run:

patch -p1 <memcached-issue99-suggested-patch

It should show you something like that:

patching file memcached.c
Hunk #1 succeeded at 331 (offset 2 lines).
Hunk #2 succeeded at 3303 (offset 274 lines).
patching file memcached.h
Hunk #1 succeeded at 335 (offset 19 lines).

After that you can do like always
./configure <YOUR OPTIONS HERE>
make
make install

Original comment by mrd...@gmail.com on 9 Dec 2010 at 1:06

GoogleCodeExporter commented 9 years ago

I'm attaching thomchin's patch, but against memcached 1.4.5 as a file just for 
convenience.

Original comment by mrd...@gmail.com on 9 Dec 2010 at 1:13

Attachments:

memcached1.4.5-issue99-suggested-patch

GoogleCodeExporter commented 9 years ago

Applying this fix and running it on my OpenBSD[1] server seems to cause 
deadlocks. A simple getMulti[2] using about 50 items causes the server to 
stall. 

Without the patch everything seems to work fine (though it's not running with 
any real load).

[1] 4.8-current, memcached 1.4.5 package.
[2] running the memcached php extension. libmemcached 0.44.

Original comment by otto.br...@gmail.com on 10 Dec 2010 at 12:11

GoogleCodeExporter commented 9 years ago

Maybe dormando is right that the patch does not fix the main problem, it just 
helps to overcome it on affected configurations.
The patch works for me, memcached works fine now under quite high load for 
several months. The server is on RedHat.

Original comment by mrd...@gmail.com on 10 Dec 2010 at 2:11

GoogleCodeExporter commented 9 years ago

Hi,

Thank you for the help. We patch memcached with 
memcached1.4.5-issue99-suggested-patch and it still crashes :-(

Last crash Dec 23 00:33:10 srv13 kernel: [3110555.180650] memcached[25152]: 
segfault at e62fb0fc ip 0804a856 sp b7e9f2d0 error 7 in memcached[8048000+12000]

Original comment by newsdest...@gmail.com on 23 Dec 2010 at 10:35

GoogleCodeExporter commented 9 years ago

Anyone still watching this ticket?

Please try 1.4.6-rc1 (or final if it's out by the time you see this). If you 
can still get it to segfault and care about getting this fixed, please respond 
to the issue.

Please state the exact versions of your OS, libevent, gcc, kernel, memcached, 
etc.

Original comment by dorma...@rydia.net on 12 Jul 2011 at 11:08

GoogleCodeExporter commented 9 years ago

Hi,

Thank you for checking in, and yes, I am still watching this ticket as it still 
continues to be an issue for me.  I manually apply my patch on every upgrade to 
eliminate the "Segmentation fault".  The patched memcached runs indefinitely 
until it is manually restarted after a version upgrade (months at a time).

Here is the information you requested:

memcached: 1.4.6-rc1
OS: linux 32 bit
libevent: 2.0.12
gcc: 4.4.4-r2
kernel: 2.6.30-r8

I did a default build of memcached from the source downloaded from the 
memcached repository.  I was able to get a SIGSEGV from a single request using 
the same test script included with the original bug filing (using a new size of 
524288):

./memcached -l localhost -vvv
slab class   1: chunk size        80 perslab   13107
slab class   2: chunk size       104 perslab   10082
slab class   3: chunk size       136 perslab    7710
slab class   4: chunk size       176 perslab    5957
slab class   5: chunk size       224 perslab    4681
slab class   6: chunk size       280 perslab    3744
slab class   7: chunk size       352 perslab    2978
slab class   8: chunk size       440 perslab    2383
slab class   9: chunk size       552 perslab    1899
slab class  10: chunk size       696 perslab    1506
slab class  11: chunk size       872 perslab    1202
slab class  12: chunk size      1096 perslab     956
slab class  13: chunk size      1376 perslab     762
slab class  14: chunk size      1720 perslab     609
slab class  15: chunk size      2152 perslab     487
slab class  16: chunk size      2696 perslab     388
slab class  17: chunk size      3376 perslab     310
slab class  18: chunk size      4224 perslab     248
slab class  19: chunk size      5280 perslab     198
slab class  20: chunk size      6600 perslab     158
slab class  21: chunk size      8256 perslab     127
slab class  22: chunk size     10320 perslab     101
slab class  23: chunk size     12904 perslab      81
slab class  24: chunk size     16136 perslab      64
slab class  25: chunk size     20176 perslab      51
slab class  26: chunk size     25224 perslab      41
slab class  27: chunk size     31536 perslab      33
slab class  28: chunk size     39424 perslab      26
slab class  29: chunk size     49280 perslab      21
slab class  30: chunk size     61600 perslab      17
slab class  31: chunk size     77000 perslab      13
slab class  32: chunk size     96256 perslab      10
slab class  33: chunk size    120320 perslab       8
slab class  34: chunk size    150400 perslab       6
slab class  35: chunk size    188000 perslab       5
slab class  36: chunk size    235000 perslab       4
slab class  37: chunk size    293752 perslab       3
slab class  38: chunk size    367192 perslab       2
slab class  39: chunk size    458992 perslab       2
slab class  40: chunk size    573744 perslab       1
slab class  41: chunk size    717184 perslab       1
slab class  42: chunk size   1048576 perslab       1
<31 server listening (auto-negotiate)
<32 send buffer was 107520, now 268435456
<32 server listening (udp)
<32 server listening (udp)
<32 server listening (udp)
<32 server listening (udp)
<33 new auto-negotiating client connection
33: going from conn_new_cmd to conn_waiting
33: going from conn_waiting to conn_read
33: going from conn_read to conn_parse_cmd
33: Client using the ascii protocol
<33 set test 0 10 524288
33: going from conn_parse_cmd to conn_nread
> NOT FOUND test
>33 STORED
33: going from conn_nread to conn_write
33: going from conn_write to conn_new_cmd
33: going from conn_new_cmd to conn_waiting
33: going from conn_waiting to conn_read
<34 new auto-negotiating client connection
33: going from conn_read to conn_closing
<33 connection closed.
34: going from conn_new_cmd to conn_waiting
34: going from conn_waiting to conn_read
34: going from conn_read to conn_parse_cmd
34: Client using the ascii protocol
<34 get test
> FOUND KEY test
>34 sending key test
>34 END
34: going from conn_parse_cmd to conn_mwrite
Segmentation fault

---

After doing a clean build using the same patch file I provided you, the daemon 
seems to be stable again (100 successful requests using the same test script).  
Thanks for looking into this once more, and let me know if you need additional 
information.

thom

Original comment by thomc...@gmail.com on 13 Jul 2011 at 1:10

GoogleCodeExporter commented 9 years ago

Is this still on gentoo? Which php client are you using and at what version?

Any chance you could run the memcached-debug program under gdb and kill it a 
few times while collecting backtraces? Curious to see if it's always in the 
same spot.

Original comment by dorma...@rydia.net on 13 Jul 2011 at 6:52

GoogleCodeExporter commented 9 years ago

Also, can you try the 1.6 beta and see if that fails for you?

Original comment by dorma...@rydia.net on 13 Jul 2011 at 7:52

GoogleCodeExporter commented 9 years ago

Yes, I am still running this on gentoo, PHP 5.3.3 with a memcache 2.2.5 client 
(as reported by phpinfo()).  Here is a backtrace under gdb:

gdb --args ./memcached-debug -l localhost -vvv
GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
(gdb) run
Starting program: /home/thom/memcached-memcached-fe2fb1d/memcached-debug -l 
localhost -vvv
[Thread debugging using libthread_db enabled]
slab class   1: chunk size        80 perslab   13107
slab class   2: chunk size       104 perslab   10082
slab class   3: chunk size       136 perslab    7710
slab class   4: chunk size       176 perslab    5957
slab class   5: chunk size       224 perslab    4681
slab class   6: chunk size       280 perslab    3744
slab class   7: chunk size       352 perslab    2978
slab class   8: chunk size       440 perslab    2383
slab class   9: chunk size       552 perslab    1899
slab class  10: chunk size       696 perslab    1506
slab class  11: chunk size       872 perslab    1202
slab class  12: chunk size      1096 perslab     956
slab class  13: chunk size      1376 perslab     762
slab class  14: chunk size      1720 perslab     609
slab class  15: chunk size      2152 perslab     487
slab class  16: chunk size      2696 perslab     388
slab class  17: chunk size      3376 perslab     310
slab class  18: chunk size      4224 perslab     248
slab class  19: chunk size      5280 perslab     198
slab class  20: chunk size      6600 perslab     158
slab class  21: chunk size      8256 perslab     127
slab class  22: chunk size     10320 perslab     101
slab class  23: chunk size     12904 perslab      81
slab class  24: chunk size     16136 perslab      64
slab class  25: chunk size     20176 perslab      51
slab class  26: chunk size     25224 perslab      41
slab class  27: chunk size     31536 perslab      33
slab class  28: chunk size     39424 perslab      26
slab class  29: chunk size     49280 perslab      21
slab class  30: chunk size     61600 perslab      17
slab class  31: chunk size     77000 perslab      13
slab class  32: chunk size     96256 perslab      10
slab class  33: chunk size    120320 perslab       8
slab class  34: chunk size    150400 perslab       6
slab class  35: chunk size    188000 perslab       5
slab class  36: chunk size    235000 perslab       4
slab class  37: chunk size    293752 perslab       3
slab class  38: chunk size    367192 perslab       2
slab class  39: chunk size    458992 perslab       2
slab class  40: chunk size    573744 perslab       1
slab class  41: chunk size    717184 perslab       1
slab class  42: chunk size   1048576 perslab       1
[New Thread 0xb7eb66c0 (LWP 6457)]
[New Thread 0xb7e74b70 (LWP 6460)]
[New Thread 0xb7673b70 (LWP 6461)]
[New Thread 0xb6e72b70 (LWP 6462)]
[New Thread 0xb6671b70 (LWP 6463)]
[New Thread 0xb5e70b70 (LWP 6464)]
<34 server listening (auto-negotiate)
<35 send buffer was 107520, now 268435456
<35 server listening (udp)
<35 server listening (udp)
<35 server listening (udp)
<35 server listening (udp)
<36 new auto-negotiating client connection
36: going from conn_new_cmd to conn_waiting
36: going from conn_waiting to conn_read
36: going from conn_read to conn_parse_cmd
36: Client using the ascii protocol
<36 set test 0 10 524288
36: going from conn_parse_cmd to conn_nread
> NOT FOUND test
>36 STORED
36: going from conn_nread to conn_write
36: going from conn_write to conn_new_cmd
36: going from conn_new_cmd to conn_waiting
36: going from conn_waiting to conn_read
<37 new auto-negotiating client connection
36: going from conn_read to conn_closing
<36 connection closed.
37: going from conn_new_cmd to conn_waiting
37: going from conn_waiting to conn_read
37: going from conn_read to conn_parse_cmd
37: Client using the ascii protocol
<37 get test
> FOUND KEY test
>37 sending key test
>37 END
37: going from conn_parse_cmd to conn_mwrite

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb7673b70 (LWP 6461)]
0xb802d1a1 in event_base_set () from /usr/lib/libevent-2.0.so.5
(gdb) backtrace
#0  0xb802d1a1 in event_base_set () from /usr/lib/libevent-2.0.so.5
#1  0x095f67c0 in ?? ()
#2  0x095f67c0 in ?? ()
#3  0x00000014 in ?? ()
#4  0x095f67d0 in ?? ()
#5  0x0804a26b in update_event (c=0xb802d0f7, new_flags=134579104)
    at memcached.c:3353
#6  0x08059d25 in event_handler (fd=37, which=2, arg=0x95f67c0)
    at memcached.c:3438
#7  0xb8030441 in event_base_loop () from /usr/lib/libevent-2.0.so.5

---

2 more additional backtraces only:

#0  0xb7ec41a1 in event_base_set () from /usr/lib/libevent-2.0.so.5
#1  0x0812dad0 in ?? ()
#2  0x0812dad0 in ?? ()
#3  0x00000014 in ?? ()
#4  0x0812dae0 in ?? ()
#5  0x0804a26b in update_event (c=0xb7ec40f7, new_flags=134579104)
    at memcached.c:3353
#6  0x08059d25 in event_handler (fd=36, which=2, arg=0x812dad0)
    at memcached.c:3438
#7  0xb7ec7441 in event_base_loop () from /usr/lib/libevent-2.0.so.5

---

#0  0xb7fcd1a1 in event_base_set () from /usr/lib/libevent-2.0.so.5
#1  0x0847e0a0 in ?? ()
#2  0x0847e0a0 in ?? ()
#3  0x00000014 in ?? ()
#4  0x0847e0b0 in ?? ()
#5  0x0804a26b in update_event (c=0xb7fcd0f7, new_flags=134579104)
    at memcached.c:3353
#6  0x08059d25 in event_handler (fd=36, which=2, arg=0x847e0a0)
    at memcached.c:3438
#7  0xb7fd0441 in event_base_loop () from /usr/lib/libevent-2.0.so.5

---

Seems to always be in the same spot, and I believe it is because the "base" 
variable does not contain a valid address.

Thanks.

Original comment by thomc...@gmail.com on 13 Jul 2011 at 8:09

GoogleCodeExporter commented 9 years ago

Hi,

As for testing 1.6.0-beta, I am having trouble manually generating "configure" 
since it is missing in the source (as it was in the release candidate).  
aclocal is reporting: "configure.ac:2: file `m4/version.m4' does not exist", 
and it is indeed missing in the source that I downloaded.  Is there something 
else I should be doing/running to get it ready to build?

Thanks,
thom

Original comment by thomc...@gmail.com on 13 Jul 2011 at 8:30

GoogleCodeExporter commented 9 years ago

http://memcached.googlecode.com/files/memcached-1.6.0_beta1.tar.gz - are you 
*sure* that you're using this file? The configure script is certainly not 
missing from this tarball.

Original comment by dorma...@rydia.net on 13 Jul 2011 at 8:38

GoogleCodeExporter commented 9 years ago

I was definitely *not* using that file :)  Downloaded it from the github source 
link.  With the provided tar link, I am getting the following warning which 
translates to an error:

gcc -std=gnu99 -DHAVE_CONFIG_H -I.   -I./include -I./libevent   
-fvisibility=hidden -pthread -g -O2 -Wall -Werror -pedantic 
-Wmissing-prototypes -Wmissing-declarations -Wredundant-decls 
-fno-strict-aliasing -MT mcstat.o -MD -MP -MF .deps/mcstat.Tpo -c -o mcstat.o 
`test -f 'programs/mcstat.c' || echo './'`programs/mcstat.c
cc1: warnings being treated as errors
programs/mcstat.c: In function 'print':
programs/mcstat.c:108: error: ignoring return value of 'fwrite', declared with 
attribute warn_unused_result
programs/mcstat.c:110: error: ignoring return value of 'fwrite', declared with 
attribute warn_unused_result
make[1]: *** [mcstat.o] Error 1

---

This was after the memcached daemon build succeeded, along with generation of 
the launching shell script, so I was able to test.  The daemon no longer 
appears to suffer from a "Segmentation fault", and I noticed that event.ev_base 
is now being manually set in memcached.c.  Did you consider this as a bug in 
libevent (not setting this value) and applied this temporary workaround until 
they fix it?

Thanks.

Original comment by thomc...@gmail.com on 13 Jul 2011 at 9:23

GoogleCodeExporter commented 9 years ago

I have relative little clue on what's been changed in 1.6 :P Was mostly curious 
if it'd been fixed there.

I'm still nervous about the change since I can't tell why it's failing in the 
first place, and have never been able to reproduce the failure on any machine 
that I have. Thinking we'll stare harder and maybe merge it anyway just after 
1.4.6 goes out, so it can sit in the tree for a while longer before 1.4.7.

Original comment by dorma...@rydia.net on 13 Jul 2011 at 9:36

GoogleCodeExporter commented 9 years ago

Staring at this stupid bug, again.

Can you point out where in memcached.c (from 1.6) you believe it to be manually 
setting the event.ev_base? The update_event() functions look essentially 
identical.

The only place it's being manually set in 1.6 (that I can see) is in the TAP 
code where it's adding an event to a different base.

I've replicated your -vvv output precisely, and have a script running which is 
attempting to set/get/delete a value of all possible sizes on a 32bit system. 
No crash.

Can you try the attached patch, and let me know which (if either) of the 
assert's your test hits? One thing we're doing wrong is not checking the output 
of event_base_set, but it doesn't look like that'd make a difference here.

Thanks! I don't want to give up on this, but I want to make sure it's done 
right.

Original comment by dorma...@rydia.net on 8 Aug 2011 at 2:02

GoogleCodeExporter commented 9 years ago

Here is another patch that I will likely stage in my for_147 tree for now. This 
is similar to yours, except it directly accesses that connection's thread's 
current base, for even more sanity.

After trying the assert patch, can you test this one against 1.4.6 and let me 
know if it has similarily good results?

Given that you're saying 1.6 works, but the backtraces in 1.4 show it going 
through some code which *wasn't* changed, I'm sorta even more terrified about 
what could possibly be causing this.

Original comment by dorma...@rydia.net on 8 Aug 2011 at 2:06

Attachments:

memcached_thread_base_fix.patch

GoogleCodeExporter commented 9 years ago

I believe the attached patch will correct the problem.  It addresses the issue 
of using a variable that has no value ever set to it (as my C example showed 
that the internal libevent event.ev_base member never gets assigned a value).  
Your patch does now use the conn.thread.base variable which gets set to the 
appropriated main_base through thread_init().

Your patch also mirrors a similar change in 1.6.0-beta1 in a less roundabout 
way.  If you look at line 5510 of daemon/memcache.c, you will notice:

        c->event.ev_base = tp->base;

It assigns event.ev_base to the thread.base, but I like your change better as 
you do not touch an internal variable in the libevent event structure, which 
could be subject to change in later libevent releases.

Hope this helps and gives you more confidence in your patch :)

Original comment by t...@genx.net on 8 Aug 2011 at 11:27

GoogleCodeExporter commented 9 years ago

That doesn't really help me no... As I said above, that particular case (line 
5510) is entirely new code that the test doesn't even run. That's for a 
specific case when a client using the TAP protocol connects and subscribes to a 
data stream.

The update_event() logic hasn't changed at all, which is where the backtrace 
would always fail.

I'm curious as to what the results of the first patch (with the asserts) are 
for anyone who's actually having this failure.

Original comment by dorma...@rydia.net on 8 Aug 2011 at 3:26

GoogleCodeExporter commented 9 years ago

Okay, sorry about that.  Because I am able to experience a "Segmentation Fault" 
immediately on the official 1.4.6 release, I only did a cursory test of 1.6 (50 
iterations) with no failure.  Did a quick grep to see if anything changed in 
relation to the event.ev_base variable and noticed it being set at line 5510, 
without backtracing everything to confirm that portion of the code is even 
being hit.

Upon further inspection of 1.6, I am noticing there is a local source build of 
libevent that has differing behavior from the standard builds I am installing 
via Gentoo.  I did some primitive debugging in conn_new() to inspect 
c->event.ev_base and base before and after the event_base_set() call.  In 1.6, 
c->event.ev_base is set to the value of base, but in 1.4.6, the value is left 
untouched (remains NULL).  This is the most likely reason why I am not 
experiencing the problem in 1.6.

Which "assert patch" are you referring to (not seeing it in this thread)?

I did a clean build of the 1.4.6 source and was able to crash it immediately.  
I then applied the memcached_thread_base_fix.patch and I have not been able to 
get it to crash.

Original comment by thomc...@gmail.com on 8 Aug 2011 at 4:37

GoogleCodeExporter commented 9 years ago

Duhr... apparently it failed to attach yesterday.

Try this? Without the other fix.

Interesting that the builtin libevent for 1.6 fixes it as well :/ That's just 
the libevent taken straight from the site (latest 1.4). Is Gentoo breaking the 
damn thing?

Original comment by dorma...@rydia.net on 8 Aug 2011 at 5:04

Attachments:

event_base_blah.patch

GoogleCodeExporter commented 9 years ago

(gentoo dev here)
I can't reproduce this on stock Gentoo, 32-bit or 64-bit. I'm wondering if your 
systems have some local changes or an old install of libevent lurking on them, 
maybe in /usr/local.

Can you run this and attach the output please?
# gcc libevent-test.c -o libevent-test.i -E && egrep 'event.h|/usr/local' 
libevent-test.i
(libevent-test.c is your C test case from the start of the bug)

Original comment by robbat2....@gmail.com on 8 Aug 2011 at 7:52

GoogleCodeExporter commented 9 years ago

the "stats" command will also return the libevent version in use...

Original comment by trond.no...@gmail.com on 8 Aug 2011 at 7:56

GoogleCodeExporter commented 9 years ago

I'm going to close out this bug; it turns out the user had an older version of 
event.h in /usr/local, which was getting pulled in before the actual installed 
version.

So I was right in figuring this was impossible. Thanks for the assist, robin!

The most bizarre thing is that even with the wrong structs being used between 
the library and memcached, that with a small patch it continued to function. It 
should've just immediately exploded so far as I'm concerned. The event.h was 
potentially four years older than the installed version of libevent.

Original comment by dorma...@rydia.net on 8 Aug 2011 at 8:05

Changed state: Fixed

Jdesk / memcached

Memcached 1.4.2 server segmentation fault #99