Closed GoogleCodeExporter closed 9 years ago
We use glibc version 2.4
Original comment by srikanth...@gmail.com
on 6 Jan 2010 at 6:43
The problem is here:
} We try to save the stack trace in the signal handler by executing 'gstack
} <process-id>'. The application is hanging inside the signal handler.
I don't know very much about 'gstack' -- is it a function or an executable?
Tthe set of functions it's safe to call from a signal handler is extremely
small; you
can use 'man 7 signal' to see. If you're calling gstack as an executable, I
guess
you're using fork + exec? In any case, my guess is you're doing dangerous
stuff in
the signal handler.
The proximate problem seems like you are ending up with a recursive call to
malloc
(probably because the signal is delivered in the middle of a memory allocation,
and
then your signal goes and tries to allocate memory), hence the hang you're
seeing.
(I think that's what you were describing in your bug report.)
I don't know enough about your application to suggest a fix, but in general,
trying
to do any substantive work in a signal handler is prone to peril. I think this
is
what you're experiencing here. I'm not sure why you need stack traces at
signal time
-- for your own CPU profiler? -- but if you want, you can look at profiler.cc
in
perftools to see how we collect stack traces in a signal handler in an
async-signal-
safe way.
Original comment by csilv...@gmail.com
on 7 Jan 2010 at 2:14
We do agree that doing something inside a signal handler is not so safe.
But our only intention is, WHY THERE SHOULD BE A SIGNAL TO BE RAISED? To catch
this
we put signal handler.
We need your help in understanding WHAT CAUSES SEGMENTATION FAULT LIKE SIGNAL IN
TCMALLOC?
After signal is raised, the hang is due to a recursive call, which we will fix
anyhow.
Original comment by srikanth...@gmail.com
on 7 Jan 2010 at 8:22
I'm sorry, I misunderstood you the first time. I think I understand better now.
What signal is being raised? tcmalloc doesn't raise any signals itself, but of
course if the signal is SIGSEGV or something, it could be raised due to a bug.
But my guess is that the issue is with the kernel. From 'man 7 signal':
---
A process-directed signal may be delivered to any one of the threads that
does not currently have the signal blocked. If more than one of the
threads has the signal unblocked, then the kernel chooses an arbitrary
thread to which to deliver the signal.
---
Most signals are process-directed; if yours is, then it's just a coincidence it
gets
raised to tcmalloc. But it's hard to say more without knowing what the signal
is.
Original comment by csilv...@gmail.com
on 7 Jan 2010 at 4:43
The signal raised we found to be SIGSEGV.
Thanks for your comment.
----
if the signal is SIGSEGV or something, it could be raised due to a bug.
Do you mean a bug in tcmalloc?
----
We are sure there is no OUT OF MEMORY situation. Is there any other possibility
that
SIGSEGV occurred in TCMALLOC (FetchFromSpans method)?
The reason for asking specific about FetchFromSpans method is,
We have almost 3 stack trace files, all are showing the last call to be
FetchFromSpans() method. Interestingly, one stack trace yesterday shown in
ReleaseToCentralCache() method too.
To explain better about our setup.
We have a Global std::map, which will be read from ~6 threads (depending on the
cores).When each thread reads the data from Global Data, it processes the data
and
processed data is stored in a LOCAL THREAD MAP.
No mutex for reading from map. If Global data needs to be modified, we just
stop all
the threads and spawn another thread for modifying data (to make sure only one
thread
modifies the data at a time) and re-create the threads for reading again.
We are encountering SIGSEGV in TCMALLOC(again from stack traces only), when a
thread
is ALLOCATING memory for its LOCAL data.
We will however re-check about 'man 7 signal' explaination in our Linux PC too.
If we
too get the same explaination, we will analyze the other threads state
thoroughly one
more time to confirm there could be any possibility of a signal to be raised
(which
atleast confirms that TCMALLOC one get the signal randomly or NOT).
Original comment by srikanth...@gmail.com
on 8 Jan 2010 at 12:17
Yes, as you said, the signal raised is SIGSEGV only. According to man pages,
SIGSEGV,
SIGFPE are specific to threads, which kind of confirms a SIGSEGV occurred in
TCMALLOC. We are using version 1.2.
Any help in guiding us to resolve the issue, is greatly appreciated.
Original comment by srikanth...@gmail.com
on 8 Jan 2010 at 3:30
Usually, a FetchFromSpans error means memory corruption somewhere else in your
application. One particular situation where you might see a crash in one memory
allocator, but not another, is if you have a heap overflow.
In our experience, crashes in FetchFromSpans are not due to a bug in tcmalloc,
but
rather a memory problem in the application. Those can be tricky to track down
-- you
may try valgrind or some similar tool to help.
I'll leave the bug open in case you find out any more info relating to the
crash.
But there's very little that can be done remotely -- you'll just have to try to
track
down what is going wrong with your memory.
Original comment by csilv...@gmail.com
on 8 Jan 2010 at 4:31
(Sorry, I meant 'stack overflow', not 'heap overflow'.
Original comment by csilv...@gmail.com
on 8 Jan 2010 at 3:45
Thanks for your advice.
We are looking into couple of memory related changes.
We have done the following before.
Vector, Map STL memory is not returned to operating system when vector.clear()
or
map.clear() is called. It is just with process and only when vector or map goes
out
of scope, they will get returned to OS.
To make the memory forcelly returning into OS, we implemented the following
kind of
code. (Swapping with empty vector)
typedef vector <int> vInt;
vInt m_vector;
--- Uses m_vector;
vInt().swap(m_vector);
After some time, we will still use m_vector.
Will this create any problem with respect to tcmalloc???
Original comment by srikanth...@gmail.com
on 11 Jan 2010 at 11:44
For the record, we too were seeing similar crashes, and it turned out to be
bugs in
our code, such as double-free memory. It might help if you linked against
tcmalloc_debug and run your test. At least it would detect these kind of
errors, if
they are there, and provide necessary information to eliminate them.
Original comment by andrey.s...@gmail.com
on 16 Jan 2010 at 9:49
Initially we couldn't reproduce the issue. But now, with certain combination,
we are
able to reproduce the issue.
In one of our scenarios, program is crashing in TCMalloc (atleast stack is
showing
like that).
If we DO NOT link with TCMalloc, it is WORKING fine.
we wonder Is there any particular cases where MALLOC is LENIENT and TCMALLOC is
NOT?
If you guys already know such cases, please do let us know.
We will try linking to tcmalloc_debug to see if it helps to narrow down the
issue.
Original comment by srikanth...@gmail.com
on 29 Jan 2010 at 3:58
Seems to resolved the issue.
Linking to debug binary helped us resolve issue. Thank you all for your support.
tcmalloc is not lenient as malloc during memory allocation. strcpy/delete
combination
resulted in corruption.
This corruption should happen with MALLOC also but really no idea why malloc is
hiding this issue and not corrupting memory.
Original comment by srikanth...@gmail.com
on 29 Jan 2010 at 8:54
Issue resolved and shall be closed as NOT A BUG as the issue is with our code.
Thanks
once again for the support
Original comment by srikanth...@gmail.com
on 29 Jan 2010 at 4:05
Great, thanks for the update.
Original comment by csilv...@gmail.com
on 29 Jan 2010 at 5:28
Original issue reported on code.google.com by
srikanth...@gmail.com
on 6 Jan 2010 at 6:18