Signal Raised in tcmalloc (fetchfromspans method)

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
We couldn't reproduce the problem in our environment too. It is happening
randomly.

What is the expected output? What do you see instead?
No signal to be raised from tcmalloc.

What version of the product are you using? On what operating system?
1.2
SUSE Linux x86_64

Please provide any additional information below.
Our application is MULTI-THREADED.
We have signal handlers for each thread and randomly encountering SIGNAL
handler get called from tcmalloc.
Here is the stack trace.

#0  0x00002b577199c221 in nanosleep () from /lib64/libpthread.so.0
#1  0x0000000000a1b03f in SpinLock::SlowLock ()
#2  0x0000000000a0f81d in tcmalloc::CentralFreeList::RemoveRange ()
#3  0x0000000000a12356 in tcmalloc::ThreadCache::FetchFromCentralCache ()
#4  0x0000000000a1b866 in (anonymous namespace)::cpp_alloc ()
#5  0x0000000000a8a5d8 in operator new ()
#6  0x00002b5771f67ad1 in std::string::_Rep::_S_create ()
#7  0x00002b5771f685d5 in fmod () from /usr/lib64/libstdc++.so.6
#8  0x00002b5771f68782 in std::basic_string<char, std::char_traits<char>,
std::allocator<char> >::basic_string () from /usr/lib64/libstdc++.so.6
#9  0x000000000053ab52 in GLOBALSIGHANDLER()
#10 <signal handler called>
#11 0x0000000000a0f370 in tcmalloc::CentralFreeList::FetchFromSpans ()
#12 0x0000000000a0f767 in tcmalloc::CentralFreeList::RemoveRange ()
#13 0x0000000000a12356 in tcmalloc::ThreadCache::FetchFromCentralCache ()
#14 0x0000000000a1b866 in (anonymous namespace)::cpp_alloc ()
#15 0x0000000000a8a5d8 in operator new ()

We try to save the stack trace in the signal handler by executing 'gstack
<process-id>'. The application is hanging inside the signal handler. 

The hang is obvious to us, because if TCMALLOC raised a signal and in
signal handler we again allocate little memory using TCMALLOC will always hang.

Need help in troubleshooting the issue.

We are really impressed with the performance of TCMALLOC in our system and
it helped a lot in improving the performance other than this reliability
glitch.

Original issue reported on code.google.com by srikanth...@gmail.com on 6 Jan 2010 at 6:18

GoogleCodeExporter commented 9 years ago

We use glibc version 2.4

Original comment by srikanth...@gmail.com on 6 Jan 2010 at 6:43

GoogleCodeExporter commented 9 years ago

The problem is here:
} We try to save the stack trace in the signal handler by executing 'gstack
} <process-id>'. The application is hanging inside the signal handler. 

I don't know very much about 'gstack' -- is it a function or an executable?

Tthe set of functions it's safe to call from a signal handler is extremely 
small; you 
can use 'man 7 signal' to see.  If you're calling gstack as an executable, I 
guess 
you're using fork + exec?  In any case, my guess is you're doing dangerous 
stuff in 
the signal handler.

The proximate problem seems like you are ending up with a recursive call to 
malloc 
(probably because the signal is delivered in the middle of a memory allocation, 
and 
then your signal goes and tries to allocate memory), hence the hang you're 
seeing.  
(I think that's what you were describing in your bug report.)

I don't know enough about your application to suggest a fix, but in general, 
trying 
to do any substantive work in a signal handler is prone to peril.  I think this 
is 
what you're experiencing here.  I'm not sure why you need stack traces at 
signal time 
-- for your own CPU profiler? -- but if you want, you can look at profiler.cc 
in 
perftools to see how we collect stack traces in a signal handler in an 
async-signal-
safe way.

Original comment by csilv...@gmail.com on 7 Jan 2010 at 2:14

Changed state: Invalid
Added labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

We do agree that doing something inside a signal handler is not so safe.

But our only intention is, WHY THERE SHOULD BE A SIGNAL TO BE RAISED? To catch 
this
we put signal handler.

We need your help in understanding WHAT CAUSES SEGMENTATION FAULT LIKE SIGNAL IN
TCMALLOC? 

After signal is raised, the hang is due to a recursive call, which we will fix 
anyhow.

Original comment by srikanth...@gmail.com on 7 Jan 2010 at 8:22

GoogleCodeExporter commented 9 years ago

I'm sorry, I misunderstood you the first time.  I think I understand better now.

What signal is being raised?  tcmalloc doesn't raise any signals itself, but of 
course if the signal is SIGSEGV or something, it could be raised due to a bug.

But my guess is that the issue is with the kernel.  From 'man 7 signal':
---
       A process-directed signal may be delivered to any one of the threads that
       does  not  currently  have the signal blocked.  If more than one of the
       threads has the signal unblocked, then the kernel chooses an  arbitrary
       thread to which to deliver the signal.
---

Most signals are process-directed; if yours is, then it's just a coincidence it 
gets 
raised to tcmalloc.  But it's hard to say more without knowing what the signal 
is.

Original comment by csilv...@gmail.com on 7 Jan 2010 at 4:43

Changed state: New

GoogleCodeExporter commented 9 years ago

The signal raised we found to be SIGSEGV.

Thanks for your comment.

----
if the signal is SIGSEGV or something, it could be raised due to a bug.
Do you mean a bug in tcmalloc?
----

We are sure there is no OUT OF MEMORY situation. Is there any other possibility 
that
SIGSEGV occurred in TCMALLOC (FetchFromSpans method)?

The reason for asking specific about FetchFromSpans method is,
We have almost 3 stack trace files, all are showing the last call to be
FetchFromSpans() method. Interestingly, one stack trace yesterday shown in
ReleaseToCentralCache() method too.

To explain better about our setup.

We have a Global std::map, which will be read from ~6 threads (depending on the
cores).When each thread reads the data from Global Data, it processes the data 
and
processed data is stored in a LOCAL THREAD MAP.

No mutex for reading from map. If Global data needs to be modified, we just 
stop all
the threads and spawn another thread for modifying data (to make sure only one 
thread
modifies the data at a time) and re-create the threads for reading again.

We are encountering SIGSEGV in TCMALLOC(again from stack traces only), when a 
thread
is ALLOCATING memory for its LOCAL data.

We will however re-check about 'man 7 signal' explaination in our Linux PC too. 
If we
too get the same explaination, we will analyze the other threads state 
thoroughly one
more time to confirm there could be any possibility of a signal to be raised 
(which
atleast confirms that TCMALLOC one get the signal randomly or NOT).

Original comment by srikanth...@gmail.com on 8 Jan 2010 at 12:17

GoogleCodeExporter commented 9 years ago

Yes, as you said, the signal raised is SIGSEGV only. According to man pages, 
SIGSEGV,
SIGFPE are specific to threads, which kind of confirms a SIGSEGV occurred in
TCMALLOC. We are using version 1.2. 

Any help in guiding us to resolve the issue, is greatly appreciated.

Original comment by srikanth...@gmail.com on 8 Jan 2010 at 3:30

GoogleCodeExporter commented 9 years ago

Usually, a FetchFromSpans error means memory corruption somewhere else in your
application.  One particular situation where you might see a crash in one memory
allocator, but not another, is if you have a heap overflow.

In our experience, crashes in FetchFromSpans are not due to a bug in tcmalloc, 
but
rather a memory problem in the application.  Those can be tricky to track down 
-- you
may try valgrind or some similar tool to help.

I'll leave the bug open in case you find out any more info relating to the 
crash. 
But there's very little that can be done remotely -- you'll just have to try to 
track
down what is going wrong with your memory.

Original comment by csilv...@gmail.com on 8 Jan 2010 at 4:31

GoogleCodeExporter commented 9 years ago

(Sorry, I meant 'stack overflow', not 'heap overflow'.

Original comment by csilv...@gmail.com on 8 Jan 2010 at 3:45

GoogleCodeExporter commented 9 years ago

Thanks for your advice.

We are looking into couple of memory related changes.

We have done the following before.

Vector, Map STL memory is not returned to operating system when vector.clear() 
or
map.clear() is called. It is just with process and only when vector or map goes 
out
of scope, they will get returned to OS.

To make the memory forcelly returning into OS, we implemented the following 
kind of
code. (Swapping with empty vector)

typedef vector <int> vInt;
vInt m_vector;
--- Uses m_vector;
vInt().swap(m_vector);

After some time, we will still use m_vector.

Will this create any problem with respect to tcmalloc???

Original comment by srikanth...@gmail.com on 11 Jan 2010 at 11:44

GoogleCodeExporter commented 9 years ago

For the record, we too were seeing similar crashes, and it turned out to be 
bugs in 
our code, such as double-free memory. It might help if you linked against 
tcmalloc_debug and run your test. At least it would detect these kind of 
errors, if 
they are there, and provide necessary information to eliminate them.

Original comment by andrey.s...@gmail.com on 16 Jan 2010 at 9:49

GoogleCodeExporter commented 9 years ago

Initially we couldn't reproduce the issue. But now, with certain combination, 
we are
able to reproduce the issue.

In one of our scenarios, program is crashing in TCMalloc (atleast stack is 
showing
like that).

If we DO NOT link with TCMalloc, it is WORKING fine.

we wonder Is there any particular cases where MALLOC is LENIENT and TCMALLOC is 
NOT?
If you guys already know such cases, please do let us know.

We will try linking to tcmalloc_debug to see if it helps to narrow down the 
issue.

Original comment by srikanth...@gmail.com on 29 Jan 2010 at 3:58

GoogleCodeExporter commented 9 years ago

Seems to resolved the issue.

Linking to debug binary helped us resolve issue. Thank you all for your support.
tcmalloc is not lenient as malloc during memory allocation. strcpy/delete 
combination
resulted in corruption.

This corruption should happen with MALLOC also but really no idea why malloc is
hiding this issue and not corrupting memory.

Original comment by srikanth...@gmail.com on 29 Jan 2010 at 8:54

GoogleCodeExporter commented 9 years ago

Issue resolved and shall be closed as NOT A BUG as the issue is with our code. 
Thanks
once again for the support

Original comment by srikanth...@gmail.com on 29 Jan 2010 at 4:05

GoogleCodeExporter commented 9 years ago

Great, thanks for the update.

Original comment by csilv...@gmail.com on 29 Jan 2010 at 5:28

Changed state: NotABug

caohaiwd / gperftools

Signal Raised in tcmalloc (fetchfromspans method) #203