casseopea2 / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

bus error in OS X when used with multiple shared .bundle's #84

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Just an an FYI, tcmalloc doesn't work with Ruby on OS X.

What steps will reproduce the problem?
1. download ruby "enterprise edition"
2. apply the patches in [1]
3. run irb from the built version.

What is the expected output?
on OS X running irb results in an immediate crash.

Thanks!
-=R

[1]
http://groups.google.com/group/emm-ruby/browse_thread/thread/a06a2e7450747b3d

Original issue reported on code.google.com by rogerpack2005 on 22 Oct 2008 at 10:45

GoogleCodeExporter commented 9 years ago
Hmm, I'll see if I can try to reproduce this, but I don't have root access on 
any of
my test macs, so it may be difficult.  Almost certainly, what is happening is 
that
readline() is trying to use tcmalloc free on something allocated with libc 
malloc. 
The most common reason for this is that tcmalloc is not specified last on the
linkline.  Can you verify if it is in the ruby build file?  If readline comes 
after
tcmalloc, it would explain the behavior seen here.

Original comment by csilv...@gmail.com on 22 Oct 2008 at 11:03

GoogleCodeExporter commented 9 years ago
I've downloaded ruby and looked at install.rb, and it seems to me that is 
indeed what
is happening.  While I don't know ruby, it looks like this code is adding 
tcmalloc to
PRELIBS, and putting PRELIBS before LIBS on the make commandline:

                        if tcmalloc_supported?
                                makefile = File.read('Makefile')
                                if makefile !~ /\$\(PRELIBS\)/
                                        makefile.sub!(/^LIBS = (.*)$/, 'LIBS = $
(PRELIBS) \1')
                                        File.open('Makefile', 'w') do |f|
                                                f.write(makefile)
                                        end
                                end
                                return sh("make PRELIBS='-Wl,-rpath,#{@prefix}/l
ib -L#{@destdir}#{@prefix}/lib -ltcmalloc_minimal'")

tcmalloc needs to go last on the link line.  I think the install script will 
either
need to postpend -ltcmalloc_minimal to LIBS, or else add POSTLIBS that goes 
after
LIBS on the commandline.

I'm closing this bug.  If it turns out that you see the problem even after 
putting
tcmalloc last on the link-line, feel free to reopen it again.

Original comment by csilv...@gmail.com on 29 Oct 2008 at 11:29

GoogleCodeExporter commented 9 years ago
Thanks for your work on this.

Original comment by rogerpack2005 on 7 Nov 2008 at 9:47

GoogleCodeExporter commented 9 years ago
Hi csilvers. I'm one of the Ruby Enterprise Edition developers.

I've tried changing the linking order as suggested, but it doesn't seem to 
help. The
binary is linked as follows:

  gcc -g -O2 -pipe -fno-common  -DRUBY_EXPORT  -L.    main.o dmydln.o
libruby-static.a -Wl,-rpath,/tmp/r8ee/lib -L/tmp/r8ee/lib -ldl -lobjc  
-ltcmalloc_minimal -o miniruby

The resulting binary crashes with a Bus Error during startup. This is on MacOS X
10.5.5 server. Even though the machine is 64-bit, the compiler generates 32-bit
binaries by default.

What else could be wrong?

Original comment by honglilai@gmail.com on 14 Nov 2008 at 1:17

GoogleCodeExporter commented 9 years ago
Some additional information:

On this machine, specifying -ltcmalloc_minimal as the first argument results in 
a
partially working binary. irb (a moderately complex Ruby script) runs fine, 
until one
tries to load the readline library, which causes Ruby to crash.

On 32-bit Ubuntu Linux, everything works fine.

Original comment by honglilai@gmail.com on 14 Nov 2008 at 1:21

GoogleCodeExporter commented 9 years ago
Thanks for the report.  This behavior is all very interesting.

Is it possible to run the binary in gdb, so you can get a stack trace on the bus
error during startup?  My guess is it's caused by some routine in either 
libobjc or,
more likely, libdl.  Perhaps they are making assumptions about using the libc 
malloc,
or perhaps we are making assumptions that they are violating.

You say loading the readline library causes a ruby crash.  Does ruby do a 
dlopen on
libreadline.so when it loads the readline library?  I'm guessing the problem in 
that
case is that libdl is in a weird state, because the stuff that runs at
global-construct time was using the libc malloc, while the stuff running at 
dlopen
time is using the tcmalloc libc.

I think the right course of action is to figure out what's causing the crash 
when
tcmalloc is last on the linkline, and go from there.  I know this is kind of a 
side
issue for you guys, and I understand if you don't want to put that much time 
into it,
but any info you can get from poking around with a debugger would be very 
helpful.

Original comment by csilv...@gmail.com on 14 Nov 2008 at 2:11

GoogleCodeExporter commented 9 years ago
I think I've figured what the problem is. It's because OS X uses a two-level
namespace scheme for symbols, as opposed to a flat namespace scheme on most 
Unix systems.

The readline library does not seem to be linked with -flat_namespace, and so it
probably uses the two-level namespace scheme. The readline library was not 
linked to
tcmalloc at compile time. So even though the Ruby interpreter is linked to 
tcmalloc,
the readline library still calls the system's malloc() and free() instead of 
tcmalloc's.

The Ruby readline extension *is* compiled with -flat_namespace. It calls 
readline(),
which returns a string allocated by the system allocator. It then calls 
tcmalloc's
free() on that data, and things go boom.

So this is not strictly a tcmalloc problem, and you can close this issue. 
Thanks for
your assistance though. :)
I've tried setting DYLD_FORCE_FLAT_NAMESPACE=1 which forces the entire process 
to use
a flat namespace for symbols, but the app would crash at startup, probably 
because OS
X system libraries don't expect a different memory allocator being used. I 
guess I'll
have to find a different way to solve this.

Original comment by honglilai@gmail.com on 14 Nov 2008 at 11:00

GoogleCodeExporter commented 9 years ago
Interesting.  It should be possible, in glibc-based systems, to have tcmalloc 
fall
back on calling glibc routines when it doesn't recognize where the pointer came 
from.
 This would make tcmalloc correctly free pointers allocated by glibc, which I guess
would fix the problem here.

I'm toying having this happen for the next release of perftools.  Do you think 
this
is a good idea?  I admit it seems pretty hacky to me, and can hide real bugs.  
But
would it solve the problem you're seeing?

Original comment by csilv...@gmail.com on 15 Nov 2008 at 2:28

GoogleCodeExporter commented 9 years ago
Oh yes, it would be a great idea. I'm currently solving it by linking Ruby to an
external library which isn't compiled with -flat_namespace. This library 
provides
wrapper functions for the system allocator's malloc() and free(). Ruby 
extensions
call these wrapper functions to release memory that they know is allocated with 
the
system allocator instead of tcmalloc.

However, this is MacOS X, which as far as I know does not use glibc.

Original comment by honglilai@gmail.com on 15 Nov 2008 at 9:02

GoogleCodeExporter commented 9 years ago
Ok, I'm going to close this bug WontFix, since I don't know how to do the
fallback-call on non-glib systems.  If it turns out OS X libc also declares 
malloc
using a weak-symbol alias to another function (like glibc does with malloc and
__libc_malloc), then I can use that to call the system malloc when the glibc 
malloc
fails.  If you know of such a technique, feel free to reopen this bug!

Original comment by csilv...@gmail.com on 13 Dec 2008 at 1:47

GoogleCodeExporter commented 9 years ago
On Mac OS X you should be able to fall back to the malloc zone APIs to free 
memory that was allocated from any of the system 
allocators.  Something like:

malloc_zone_t* zone = malloc_zone_from_ptr(ptr);
if (zone) {
    malloc_zone_free(zone, ptr);
} else {
 // ptr wasn't allocated by any of the system allocators.
}

Original comment by mark.r...@gmail.com on 30 Jan 2009 at 11:51

GoogleCodeExporter commented 9 years ago
Hongli, do you want to try mark's technique and see if it works for you?  I'm 
still
not sure I want to support it by default, since it may hide real bugs, but if it
fixes problems like yours that will definitely weigh in my decision.

It should be easy to try out:
1) Add #include <malloc/malloc.h> to tcmalloc.cc
2) Add the following to the top of InvalidFree(), in tcmalloc.cc:
   malloc_zone_t* zone = malloc_zone_from_ptr(ptr);
   if (zone) {
       malloc_zone_free(zone, ptr);
       return;
   }
3) Do the equivalent, using malloc_zone_realloc, in InvalidRealloc if you need 
to.

Let me know how it works!

craig

Original comment by csilv...@gmail.com on 30 Jan 2009 at 11:09

GoogleCodeExporter commented 9 years ago
Hongli, did youhave any chance to try this out?

Original comment by csilv...@gmail.com on 18 Apr 2009 at 12:16

GoogleCodeExporter commented 9 years ago
Sorry, I forgot about this issue but I just gave it a try.

I modified InvalidFree, but there's no InvalidRealloc function. There is
InvalidGetSizeForRealloc, but I can't call malloc_zone_realloc there because I 
need a
'size' argument.

Modifying just InvalidFree doesn't seem to work. I get this:

  malloc: *** error for object 0x535330: Non-aligned pointer being freed (2)

This is in fact an error outputted by OS X's memory allocator. The problem is 
that OS
X is trying to free memory by tcmalloc, so I don't think there's anything 
tcmalloc
can do about this.

Original comment by honglilai@gmail.com on 18 Apr 2009 at 7:59

GoogleCodeExporter commented 9 years ago
Sorry, I forgot to mention, InvalidRealloc went away once I realized it didn't 
work
as written.  In the latest perftools release, you would indeed have to call
InvalidGetSizeForRealloc, which would need to call malloc_zone_<msize>, where 
<msize>
is whatever function exists in OS X to get the memory-size associated with a 
pointer.
 There may not be one...

} The problem is that OS X is trying to free memory by tcmalloc

Wait -- OS X is trying to free memory that tcmalloc allocated?  You had 
described the
problem like this:

} The Ruby readline extension *is* compiled with -flat_namespace. It calls
} readline(), which returns a string allocated by the system allocator. It then
} calls tcmalloc's free() on that data, and things go boom.

which means tcmalloc is calling free on a libc-allocated string.  But it sounds 
like
it happens the other way around too, where libc calls free() on a 
tcmalloc-allocated
string?

You're right that if that is what is happening, I don't see any way for 
tcmalloc to
work around that.  Can you confirm that that is really the problem situation 
you're
seeing?

Original comment by csilv...@gmail.com on 20 Apr 2009 at 4:33

GoogleCodeExporter commented 9 years ago
Yes I did describe it like that. By the time I finished coding the solution, I 
found
that it was actually doing both: in some code, tcmalloc tries to free memory
allocated by OS X, and in other code, OS X tries to free memory allocated by 
tcmalloc.

Sorry for any confusion that this might have caused.

Original comment by honglilai@gmail.com on 20 Apr 2009 at 5:06

GoogleCodeExporter commented 9 years ago
In this case, I'm afraid I'm going to have to close this WillNotFix.  I'm not 
sure
there's a fix we *can* do that handles both cases.  I'm glad to see you were 
able to
come up with a successful workaround.

Original comment by csilv...@gmail.com on 20 Apr 2009 at 5:52

GoogleCodeExporter commented 9 years ago
Dunno whether this is the right place, but I've encountered the very same 
problem while trying to enable tcmalloc for Chromium on Mac OS:

1. The program starts the initialization and allocates some memory using 
malloc_zone_malloc()
2. tcmalloc initializes and replaces the default zone with its own one.
3. The program tries to free the memory allocated in 1 using tc_free and gets 
an "invalid free" error.

I've came up with the same idea of fixing this by falling back to the default 
zone via malloc_zone_from_ptr():

 558 void InvalidFree(void* ptr) {
 559 #ifdef __APPLE__
 560   // Before reporting an invalid free, try to find the zone that actually allocated |ptr|,
 561   // and deallocate it from that zone.
 562   malloc_zone_t *alloc_zone = malloc_zone_from_ptr(ptr);
 563   if (alloc_zone) {
 564     malloc_zone_free(alloc_zone, ptr);
 565     return;
 566   }
 567 #endif
 568   CRASH("Attempt to free invalid pointer: %p\n", ptr);
 569 }

, but now I get a bus error at the place where malloc_zone_from_ptr is called:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000004
0x96586bed in szone_size ()
(gdb) bt
#0  0x96586bed in szone_size ()
#1  0x96587eb7 in malloc_zone_from_ptr ()
#2  0x00320128 in (anonymous namespace)::InvalidFree (ptr=0x1307a60) at 
/Users/glider/src/chrome-commit/src/base/allocator/../../third_party/tcmalloc/ch
romium/src/tcmalloc.cc:562
#3  0x00031698 in (anonymous namespace)::do_free_with_callback (ptr=0x1307a60, 
invalid_free_fn=0x320110 <(anonymous namespace)::InvalidFree(void*)>) at 
/Users/glider/src/chrome-commit/src/base/allocator/../../third_party/tcmalloc/ch
romium/src/tcmalloc.cc:1200
#4  0x00031902 in (anonymous namespace)::do_free (ptr=0x1307a60) at 
/Users/glider/src/chrome-commit/src/base/allocator/../../third_party/tcmalloc/ch
romium/src/tcmalloc.cc:1235
#5  0x00d17a4f in tc_free (ptr=0x1307a60) at 
/Users/glider/src/chrome-commit/src/base/allocator/../../third_party/tcmalloc/ch
romium/src/tcmalloc.cc:1542
#6  0x0031e5ea in (anonymous namespace)::mz_free (zone=0x12eb000, 
ptr=0x1307a60) at 
/Users/glider/src/chrome-commit/src/base/allocator/../../third_party/tcmalloc/ch
romium/src/tcmalloc.cc:327
#7  0x9244619a in __CFBasicHashRehash ()
#8  0x9246c568 in CFBasicHashRemoveValue ()
#9  0x9246c032 in __CFDoExternRefOperation ()
#10 0x9246c085 in -[NSObject(NSObject) release] ()

Looks like szone_size treats its |zone| parameter as a szone_t 
(http://google.com/codesearch/p?hl=en#pFm0LxzAWvs/darwinsource/tarballs/apsl/Lib
c-391.tar.gz%7Cz8mNFiEo9vA/Libc-391/gen/scalable_malloc.c&q=szone_size&l=3120), 
whereas we only copy the first malloc_zone_t belonging to that szone_t when we 
register our own allocators. I therefore doubt that it's safe to replace the 
system malloc this way.

Declaring szone_t and replacing the whole (szone_t*)malloc_default_zone() in 
ReplaceSystemAlloc seems to help a little (several "non-aligned pointer" are 
printed indeed, as comment 14 states), but after a while the program reaches a 
spinlock in szone_free_definite_size(), where it either segfaults or locks 
forever:

#0  0xffff0269 in __spin_lock ()
#1  0x96586fbb in szone_free_definite_size ()
#2  0x96586ba8 in free ()
#3  0x9658b17f in _reclaim_telldir ()
#4  0x9658b12d in closedir$UNIX2003 ()
#5  0x924520db in _CFBundleCopyDirectoryContentsAtPath ()
#6  0x9245e6f1 in _CFBundleCopyInfoDictionaryInDirectoryWithVersion ()
#7  0x9245e24f in CFBundleGetInfoDictionary ()
#8  0x9245031d in _CFBundleCreate ()
#9  0x92492c56 in _CFBundleEnsureBundleExistsForImagePath ()
#10 0x9249274b in CFBundleGetBundleWithIdentifier ()
#11 0x96a3ff3d in HLTBGetBundle ()
#12 0x96a3fe9c in HIGetScaleFactor ()
#13 0x9432437f in __NSHasDisplayScaleFactor ()
#14 0x94323a7a in -[NSApplication init] ()
#15 0x9432353d in +[NSApplication sharedApplication] ()
#16 0x002e1e6c in mock_cr_app::RegisterMockCrApp () at 
/Users/glider/src/chrome-commit/src/base/test/mock_chrome_application_mac.mm:16
#17 0x002e26f4 in base::TestSuite::Initialize (this=0xbffff998) at 
/Users/glider/src/chrome-commit/src/base/test/test_suite.cc:182
#18 0x002e2abc in base::TestSuite::Run (this=0xbffff998) at 
/Users/glider/src/chrome-commit/src/base/test/test_suite.cc:124
#19 0x00034000 in main (argc=2, argv=0xbffffa34) at 
/Users/glider/src/chrome-commit/src/base/test/run_all_unittests.cc:8

The spinlock is probably that of szone_t, but because we just copied the 
structure without actually registering a scalable zone, something could break 
down.

Original comment by gli...@google.com on 20 May 2011 at 10:54

GoogleCodeExporter commented 9 years ago
I'll reopen this, since I think we've figured out a fix, using the malloc-zone 
functionality to hook in tcmalloc as the default malloc zone.  I hope to get 
this working next week.

Original comment by csilv...@gmail.com on 20 May 2011 at 6:48

GoogleCodeExporter commented 9 years ago
I think this should be fixed with the new malloc-zone implementation in 
perftools 1.8.  I'm closing this bug fixed, but feel free to reopen if there 
are still problems.

Original comment by csilv...@gmail.com on 31 Aug 2011 at 10:03