alloc-tls: Handle thread-local storage on platforms without #[thread_local]

joshlf commented 7 years ago

On platforms without the #[thread_local] attribute, thread-local storage (TLS) for an allocator is particularly tricky because most obvious implementations (including the one implemented in the standard library) require allocation to work, and do so in such a way that detecting what state a TLS value is in requires allocation, meaning that recursion (caused by an allocation call accessing TLS, triggering further allocation) cannot be detected. While it's possible that this will be fixed at some point in the future, it won't be fixed any time soon.

Inspired by this comment on a jemalloc issue, our best bet may be to register a handler that is called by pthread whenever a thread spawns, and use this opportunity to preemptively initialize TLS for that thread.

joshlf commented 6 years ago

Replying to @davidtgoldblatt's comment:

@joshlf, the tsd struct is defined at https://github.com/jemalloc/jemalloc/blob/a315688be0f38188f16fe89ee1657c7f596f8cbb/src/tsd.c#L15, with the tls model set in the configure script https://github.com/jemalloc/jemalloc/blob/82d1a3fb318fb086cd4207ca03dbdd5b0e3bbb26/configure.ac#L725.

Basically, you just need to set the TLS model to be "initial-exec" (or "gnu2" if you feel more confident about the implementation issues there than we do). (Not sure how easy it is to do this with rust).

Happy to help more, but let's discuss on the the elfmalloc issue to avoid polluting this thread.

Awesome, thanks so much! I'll take a look at those.

joshlf commented 6 years ago

So to clarify our problem, we already know what we want to be stored in TLS - we just aren't sure how to get that structure into TLS in all cases. Rust supports a #[thread_local] attribute, which corresponds to LLVM's TLS support (the GeneralDynamicTLSModel in particular, if I'm reading the PR correctly) which, as I understand it based on this LLVM documentation, maps to an ELF feature. On platforms on which that's supported, everything works fine.

However, on platforms on which it's not supported, the only mechanism that the standard library exposes uses allocation under the hood (pthread_setspecific/pthread_getspecific on POSIX, etc), which of course we can't use. So what I'm trying to figure out is whether there's a way - without support for the ELF TLS feature - to get TLS to work in a way that doesn't recursively depend on allocation or, if it does, allows this recursion to be easily detectable so that we can fall back to a global allocation if necessary.

If I'm reading your comment correctly, what you're describing is a solution to the first problem - that we solve with Rust's #[thread_local] attribute - but not to the second problem?

davidtgoldblatt commented 6 years ago

Oh, I see what you're saying. Yeah, the initial-exec TLS model addresses the issue that, if you're loading your TLS-containing object from a shared library, the memory from it will get lazily allocated, via malloc.

pthread_[g|s]etspecific should actually work for the most part. A pthread_key_t is (under glibc, and I assume most libcs) just an index into a per-thread data structure whose early bits are initialized with the thread. If your initialization code runs early enough (which you can probably ensure through a constructor attribute, or just the fact that malloc will probably be called very early in process startup), you should be able to reserve one of the early keys, which probably won't require dynamic allocation.

Some early multithreaded mallocs had a global hashmap (backed by a slow fallback allocator that doesn't use thread-local data, but that a thread needs to use only once, on its first allocator call) that mapped thread ids to the thread-local data for that thread. I suspect this approach would have unpleasantly large constant factors, though.

In the end, I don't think there's a good way of getting around all the annoying platform-specific stuff if you're trying to optimize for speed; I think jemalloc ends up with something like 4 substantially different TLS implementations. You might find some useful inspiration in https://github.com/jemalloc/jemalloc/blob/dev/include/jemalloc/internal/tsd.h and friends (e.g. tsd_generic.h, tsd_win.h, etc.) and https://github.com/jemalloc/jemalloc/blob/dev/src/tsd.c (you can ignore the TSD struct and the X-macro nonsense; the tsd_fetch implementation is the interesting part).

Note that TLS isn't the only reentrancy scenario you might need to worry about; on many platforms, "safe"-seeming standard library calls end up allocating (e.g. mutex locking, file reading, getting the CPU count, off the top of my head).

joshlf commented 6 years ago

That's hugely helpful, thanks! I'll mull over that for a while :)

joshlf commented 6 years ago

Update: On OS X, we're getting bitten by the same issue that rpmalloc is: https://github.com/rampantpixels/rpmalloc/issues/29. It seems to come from the fact that when using DYLD_INSERT_LIBRARIES, _tlv_bootstrap is implemented simply as abort().

joshlf commented 6 years ago

pthread_[g|s]etspecific should actually work for the most part. A pthread_key_t is (under glibc, and I assume most libcs) just an index into a per-thread data structure whose early bits are initialized with the thread. If your initialization code runs early enough (which you can probably ensure through a constructor attribute, or just the fact that malloc will probably be called very early in process startup), you should be able to reserve one of the early keys, which probably won't require dynamic allocation.

@davidtgoldblatt Do you mean that if you get there late enough, pthread_create_key will allocate, or every thread spawn will require allocation to make space for the key you allocated with pthread_create_key?

davidtgoldblatt commented 6 years ago

My memory on this is a little hazy; you'd have to check the glibc (or whatever libc you're interested in) to be sure. Here's the way I remember glibc working:

a pthread key is just an integer, an index into a per-thread, growable control structure
That per-thread structure has SOME_CONSTANT_1 structs to contain the metadata you initialize in pthread_setspecific, an initially null pointer to a followup structure, and some counters and whatnot to maintain consistency.
All these structures share the same internal layout; if only one thread touches key 123, then all other threads have null metadata in slot 123 of the structure.
Every thread gets a copy of that control structure upon creation (created by start.S for the initial thread, or by the parent thread for other threads), living at the top of the stack.
When you've created SOME_CONSTANT_1 pthread keys, then, you can still allocate more. The first time you pthread_setspecific on that key index, the implementation will notice that the followup pointer in the control structure is null, allocate another one (using malloc), and use that to satisfy the allocation.

So pthread_key_create won't allocate (it just acquires a mutex and bumps a counter). It's the first time you touch the per-thread data in a thread that you get an allocation.

We get our pthread key in a library constructor at load time[1]. As far as I know, this is early enough that it hasn't caused problems for anyone.

[1] We actually do this even if TLS is available, to get us a place to put our per-thread shutdown hooks.

joshlf commented 6 years ago

OK that makes perfect sense, thanks! We'll probably end up doing something similar because even if we fix this dylib issue (more on that in a second...), we'll still need to support platforms that don't support the #[thread_local] attribute.

I just discovered something interesting about that issue I linked to above: https://github.com/rampantpixels/rpmalloc/issues/29. It looks like the issue subsides if you only access TLS after the dynamic loading is complete. My hunch, given the stack trace pasted below, is that our malloc is getting called while the dynamic loader is still doing its thing, in turn using TLS, and TLS is not supported during dynamic loading. There's a hint at reduced TLS support in dynamic libraries here ("No fast thread local storage. pthread_getspecific(3) is currently the only option."), although it seems to imply that fast TLS is never supported, while in reality, I was able to get a test working that used TLS in a dynamic library, but after that library was fully loaded.

Stack trace

```text * thread #1: tid = 0x0000, 0x00007fff9364bf06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP * frame #0: 0x00007fff9364bf06 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff9868f4ec libsystem_pthread.dylib`pthread_kill + 90 frame #2: 0x00007fff86d856df libsystem_c.dylib`abort + 129 frame #3: 0x00007fff925a53e0 libdyld.dylib`_tlv_bootstrap + 9 frame #4: 0x000000010d622e99 libelfc.dylib`elfmalloc::general::global::alloc(size=8) + 41 at general.rs:371 frame #5: 0x000000010d5d4436 libelfc.dylib`elfmalloc::alloc_impl::{{impl}}::c_malloc(self=0x000000010d6658e5, size=8) + 70 at alloc_impl.rs:60 frame #6: 0x000000010d5b9c47 libelfc.dylib`elfc::malloc(size=8) + 39 at :4 frame #7: 0x00007fff622861be dyld`operator new(unsigned long) + 30 frame #8: 0x00007fff622736c5 dyld`std::__1::vector >::insert(std::__1::__wrap_iter, char const* (* const&)(dyld_image_states, unsigned int, dyld_image_info const*)) + 343 frame #9: 0x00007fff6226e507 dyld`dyld::registerImageStateBatchChangeHandler(dyld_image_states, char const* (*)(dyld_image_states, unsigned int, dyld_image_info const*)) + 147 frame #10: 0x00007fff925a489e libdyld.dylib`dyld_register_image_state_change_handler + 76 frame #11: 0x00007fff925a465f libdyld.dylib`_dyld_initializer + 47 frame #12: 0x00007fff9757c9fd libSystem.B.dylib`libSystem_initializer + 116 frame #13: 0x00007fff6227d10b dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 265 frame #14: 0x00007fff6227d284 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40 frame #15: 0x00007fff622798bd dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 305 frame #16: 0x00007fff62279852 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 198 frame #17: 0x00007fff62279743 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 127 frame #18: 0x00007fff622799b3 dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 75 frame #19: 0x00007fff6226c0ab dyld`dyld::initializeMainExecutable() + 138 frame #20: 0x00007fff6226fd98 dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) + 3596 frame #21: 0x00007fff6226b276 dyld`dyldbootstrap::start(macho_header const*, int, char const**, long, macho_header const*, unsigned long*) + 512 frame #22: 0x00007fff6226b036 dyld`_dyld_start + 54 ```

Have you guys run into anything similar with jemalloc?

davidtgoldblatt commented 6 years ago

Doesn't ring a bell, but most of the tricky TLS stuff was figured out before my time. I don't know the details of TLS on OS X (actually, my memory was that their compiler people had disabled it there because they wanted to come up with a new implementation without having to worry about backwards compatibility). I think we exclusively use pthreads for our thread-local data on OS X.

joshlf commented 6 years ago

Gotcha. I actually managed to solve this with a surprisingly simple approach - I have a global bool that's initialized to false and then set to true in the library constructor. The TLS code checks that boolean and uses a global slow path if it's false (avoiding accessing TLS).

ezrosent / allocators-rs

alloc-tls: Handle thread-local storage on platforms without #[thread_local] #54