tcmalloc core dump in SLL_next

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. hard to say - havent hit it frequently.

What is the expected output? What do you see instead?
core dump

What version of the product are you using? On what operating system?
gperftools 2.1
Linux 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 
x86_64 x86_64 x86_64 GNU/Linux

Please provide any additional information below.

What I noticed is that the "ThreadCache::FreeList" structure in 
src/thread_cache.h is corrupted.

In gdb, I print "*this" and get the following output
p *this
$12 = {list_ = 0x156, length_ = 4294967295, lowater_ = 0, max_length_ = 3, 
length_overages_ = 0}

If you look at "length_" data member, it is UINT_MAX (0xffffffff).  But the 
code had just decremented "length_" in the Pop() function before calling 
SLL_Pop().   This means "length_" was ZERO when it was decremented at the call 
to Pop().

The stack is as follows:
I call leveldb which in turn calls tcmalloc

#0  SLL_Next (t=0x156) at src/linked_list.h:44
#1  SLL_Pop (list=<optimized out>) at src/linked_list.h:58
#2  Pop (this=0x10e8220) at src/thread_cache.h:215
#3  Allocate (cl=25, size=<error reading variable: Cannot access memory at 
address 0xc8>, 
    this=<optimized out>) at src/thread_cache.h:367
#4  do_malloc_small (size=<error reading variable: Cannot access memory at 
address 0xc8>, 
    heap=<optimized out>) at src/tcmalloc.cc:1088
#5  do_malloc_no_errno (size=512) at src/tcmalloc.cc:1095
#6  cpp_alloc (nothrow=false, size=512) at src/tcmalloc.cc:1423
#7  tc_new (size=512) at src/tcmalloc.cc:1601
#8  0x00007fa1f136cc8f in leveldb::DBImpl::Write(leveldb::WriteOptions const&, 
leveldb::WriteBatch*) ()
   from /usr/local/lib/libleveldb.so.1
#9  0x00007fa1f13676b4 in leveldb::DB::Put(leveldb::WriteOptions const&, 
leveldb::Slice const&, leveldb::Slice const&) () from 
/usr/local/lib/libleveldb.so.1
#10 0x00007fa1f13676f9 in leveldb::DBImpl::Put(leveldb::WriteOptions const&, 
leveldb::Slice const&, leveldb::Slice const&) () from 
/usr/local/lib/libleveldb.so.1

Original issue reported on code.google.com by sanjos...@gmail.com on 14 Nov 2013 at 10:07

GoogleCodeExporter commented 9 years ago


more debugging information if you need.. since this was a debug build, I am 
able to print the tcmalloc::ThreadCache objects for all the threads.

p tcmalloc::ThreadCache::tsd_inited_
$30 = true
(gdb) p tcmalloc::ThreadCache::thread_heap_count_
$31 = 74
(gdb) p tcmalloc::ThreadCache::thread_heaps_
$32 = (tcmalloc::ThreadCache *) 0x110ef18
(gdb) p tcmalloc::ThreadCache::threadlocal_data_
$33 = {heap = 0x10e7f98, min_size_for_slow_path = 262145}
(gdb) p tcmalloc::ThreadCache::unclaimed_cache_space_ 
$34 = 0
(gdb) p tcmalloc::ThreadCache::next_memory_steal_ 
$35 = (tcmalloc::ThreadCache *) 0x10e2a98
(gdb) p tcmalloc::ThreadCache::per_thread_cache_size_ 
$36 = 4194304

Original comment by sanjos...@gmail.com on 14 Nov 2013 at 10:34

GoogleCodeExporter commented 9 years ago


I used gdb to traverse the entire "tcmalloc::ThreadCache::thread_heaps_" linked 
list.  I am attaching the output.

See line 5698. 
{list_ = 0x156, length_ = 4294967295, lowater_ = 0, max_length_ = 3,

That is the freeList which got corrupted

Original comment by sanjos...@gmail.com on 14 Nov 2013 at 11:38

Attachments:

tcmalloc_coredump.list

GoogleCodeExporter commented 9 years ago

Thanks for lots of details. It's still going to be hard to help you because 
it's impossible to say if your app is causing it or some bug in tcmalloc.

Is there any chance you can give me some reasonably sized program to try this 
bug myself ?

Original comment by alkondratenko on 14 Nov 2013 at 6:45

GoogleCodeExporter commented 9 years ago


The application has been running stable before.  This crash occurred when I 
changed leveldb options (disabled compression and enabled bloom filter on my 
db).

Let me see if I can reproduce it on a smaller scale and also run valgrind.

Isn't there anything obvious which strikes you from the data structure dump ?

Original comment by sanjos...@gmail.com on 15 Nov 2013 at 2:09

GoogleCodeExporter commented 9 years ago

0x156 doesn't tell me anything, sadly. If there's a way to attach some test 
program please do.

Consider also testing with tools like valgrind or address sanitizer.

Original comment by alkondratenko on 17 Nov 2013 at 4:00

GoogleCodeExporter commented 9 years ago

I got similar problem(core dump) with gperftools 1.7
stack info :

(gdb) bt
#0  (anonymous namespace)::cpp_alloc (size=Variable "size" is not available.
) at src/linked_list.h:43
#1  0x00000000006293da in tc_new (size=31) at src/tcmalloc.cc:1521
#2  0x000000302d3901de in std::string::_Rep::_S_create () from 
/usr/lib64/libstdc++.so.6
#3  0x000000302d39259b in std::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_string$base () from /usr/lib64/libstdc++.so.6
#4  0x000000302d3926b3 in std::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_string () from /usr/lib64/libstdc++.so.6
#5  0x000000000052d16a in bgcc::TimeUtil::get_time () at time_util.cpp:237
#6  0x0000000000534f26 in bgcc::FileLogDevice::create_filename_suffix 
(this=0x1031d00) at log_device.cpp:200
#7  0x0000000000535126 in bgcc::FileLogDevice::exec_size_split_policy 
(this=0x1031d00, len=91) at log_device.cpp:210
#8  0x0000000000534c89 in bgcc::FileLogDevice::write (this=0x1031d00, 
log_message=@0x7fff78e73ab0) at log_device.cpp:162
#9  0x000000000051f2ff in bgcc::LogDeviceManager::write (this=0x9081f0, 
device_name=0x675c78 "bgcc", log_message=@0x7fff78e73ab0) at 
log_device_manager.cpp:355
#10 0x000000000051a2f3 in bgcc::EventCallback::DataCallback (el=0x3ee0d558, 
fd=379, arg=0x3ed4a000) at event_callback.cpp:134
#11 0x000000000051d438 in bgcc::EventLoop::loop (this=0x3ee0d558) at 
event_poll.cpp:169
#12 0x0000000000518768 in bgcc::EpollServer::serve (this=0x3ed4a000) at 
epoll_server.cpp:77
#13 0x0000000000419fc2 in main (argc=1, argv=0x7fff78e74a38) at 
ims_service.cpp:170

Original comment by daviddan...@gmail.com on 10 Feb 2014 at 11:25

GoogleCodeExporter commented 9 years ago

got similar problem when malloc memory.

Program terminated with signal 11, Segmentation fault.
#0  SLL_Pop (size=16) at ./thirdparty/gperftools-2.0/src/linked_list.h:58
58  ./thirdparty/gperftools-2.0/src/linked_list.h: No such file or directory.
    in ./thirdparty/gperftools-2.0/src/linked_list.h
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 
libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  SLL_Pop (size=16) at ./thirdparty/gperftools-2.0/src/linked_list.h:58
#1  Pop (size=16) at ./thirdparty/gperftools-2.0/src/thread_cache.h:204
#2  Allocate (size=16) at ./thirdparty/gperftools-2.0/src/thread_cache.h:344
#3  do_malloc (size=16) at thirdparty/gperftools-2.0/src/tcmalloc.cc:1068
#4  do_malloc_or_cpp_alloc (size=16) at 
thirdparty/gperftools-2.0/src/tcmalloc.cc:1005
#5  tc_malloc (size=16) at thirdparty/gperftools-2.0/src/tcmalloc.cc:1492
#6  0x00000000004d498a in crawl::LocalDNSCache::PutInCache (this=0x94a6180, 
host=<value optimized out>, ip_info=...) at crawl/base/dns_client.cpp:96
#7  0x00000000004d59c5 in crawl::DnsClientInternal::LookupIp (this=0x36de140, 
hosts=<value optimized out>, ip_infos=0x7fce1854c540)
    at crawl/base/dns_client.cpp:245

Original comment by lifangmi...@gmail.com on 17 Feb 2014 at 9:32

GoogleCodeExporter commented 9 years ago

My understanding from talking to Chrome folks is that this is usually sign of 
bug in application.

Particularly bugs like that was reason why they went for doubly linked list.

Consider running your app under debug malloc implementation. Or valgrind. Or 
-fsanitize-address

Original comment by alkondratenko on 17 Feb 2014 at 7:01

GoogleCodeExporter commented 9 years ago

@alkondratenko

I have two question want to consult you.

----Consider running your app under debug malloc implementation. Or valgrind. 
Or -fsanitize-address

How to make app running under debug malloc implementation?

The option -fsanitize-address is compiler's option?

thanks.

Original comment by scaler...@gmail.com on 19 Jul 2014 at 4:57

GoogleCodeExporter commented 9 years ago

Yes sanitize address is compiler option. Both clang and gcc support it in their 
latest versions.

Regarding debug malloc, it's just linking with -ltcmalloc_debug

Original comment by alkondratenko on 19 Jul 2014 at 6:21

GoogleCodeExporter commented 9 years ago

Going to assume for now that it's application issue and not malloc bug. When 
and if you obtain evidence that this is not app's fault, please reopen with as 
much details as possible.

Original comment by alkondratenko on 19 Jul 2014 at 6:22

Changed state: CannotReproduce

Gwinel / gperftools

tcmalloc core dump in SLL_next #589