Open joshlf opened 7 years ago
Replying to @davidtgoldblatt's comment:
@joshlf, the tsd struct is defined at https://github.com/jemalloc/jemalloc/blob/a315688be0f38188f16fe89ee1657c7f596f8cbb/src/tsd.c#L15, with the tls model set in the configure script https://github.com/jemalloc/jemalloc/blob/82d1a3fb318fb086cd4207ca03dbdd5b0e3bbb26/configure.ac#L725.
Basically, you just need to set the TLS model to be "initial-exec" (or "gnu2" if you feel more confident about the implementation issues there than we do). (Not sure how easy it is to do this with rust).
Happy to help more, but let's discuss on the the elfmalloc issue to avoid polluting this thread.
Awesome, thanks so much! I'll take a look at those.
So to clarify our problem, we already know what we want to be stored in TLS - we just aren't sure how to get that structure into TLS in all cases. Rust supports a #[thread_local]
attribute, which corresponds to LLVM's TLS support (the GeneralDynamicTLSModel in particular, if I'm reading the PR correctly) which, as I understand it based on this LLVM documentation, maps to an ELF feature. On platforms on which that's supported, everything works fine.
However, on platforms on which it's not supported, the only mechanism that the standard library exposes uses allocation under the hood (pthread_setspecific
/pthread_getspecific
on POSIX, etc), which of course we can't use. So what I'm trying to figure out is whether there's a way - without support for the ELF TLS feature - to get TLS to work in a way that doesn't recursively depend on allocation or, if it does, allows this recursion to be easily detectable so that we can fall back to a global allocation if necessary.
If I'm reading your comment correctly, what you're describing is a solution to the first problem - that we solve with Rust's #[thread_local]
attribute - but not to the second problem?
Oh, I see what you're saying. Yeah, the initial-exec TLS model addresses the issue that, if you're loading your TLS-containing object from a shared library, the memory from it will get lazily allocated, via malloc.
pthread_[g|s]etspecific
should actually work for the most part. A pthread_key_t is (under glibc, and I assume most libcs) just an index into a per-thread data structure whose early bits are initialized with the thread. If your initialization code runs early enough (which you can probably ensure through a constructor attribute, or just the fact that malloc will probably be called very early in process startup), you should be able to reserve one of the early keys, which probably won't require dynamic allocation.
Some early multithreaded mallocs had a global hashmap (backed by a slow fallback allocator that doesn't use thread-local data, but that a thread needs to use only once, on its first allocator call) that mapped thread ids to the thread-local data for that thread. I suspect this approach would have unpleasantly large constant factors, though.
In the end, I don't think there's a good way of getting around all the annoying platform-specific stuff if you're trying to optimize for speed; I think jemalloc ends up with something like 4 substantially different TLS implementations. You might find some useful inspiration in https://github.com/jemalloc/jemalloc/blob/dev/include/jemalloc/internal/tsd.h and friends (e.g. tsd_generic.h, tsd_win.h, etc.) and https://github.com/jemalloc/jemalloc/blob/dev/src/tsd.c (you can ignore the TSD struct and the X-macro nonsense; the tsd_fetch implementation is the interesting part).
Note that TLS isn't the only reentrancy scenario you might need to worry about; on many platforms, "safe"-seeming standard library calls end up allocating (e.g. mutex locking, file reading, getting the CPU count, off the top of my head).
That's hugely helpful, thanks! I'll mull over that for a while :)
Update: On OS X, we're getting bitten by the same issue that rpmalloc is: https://github.com/rampantpixels/rpmalloc/issues/29. It seems to come from the fact that when using DYLD_INSERT_LIBRARIES
, _tlv_bootstrap
is implemented simply as abort()
.
pthread_[g|s]etspecific
should actually work for the most part. A pthread_key_t is (under glibc, and I assume most libcs) just an index into a per-thread data structure whose early bits are initialized with the thread. If your initialization code runs early enough (which you can probably ensure through a constructor attribute, or just the fact that malloc will probably be called very early in process startup), you should be able to reserve one of the early keys, which probably won't require dynamic allocation.
@davidtgoldblatt Do you mean that if you get there late enough, pthread_create_key
will allocate, or every thread spawn will require allocation to make space for the key you allocated with pthread_create_key
?
My memory on this is a little hazy; you'd have to check the glibc (or whatever libc you're interested in) to be sure. Here's the way I remember glibc working:
So pthread_key_create won't allocate (it just acquires a mutex and bumps a counter). It's the first time you touch the per-thread data in a thread that you get an allocation.
We get our pthread key in a library constructor at load time[1]. As far as I know, this is early enough that it hasn't caused problems for anyone.
[1] We actually do this even if TLS is available, to get us a place to put our per-thread shutdown hooks.
OK that makes perfect sense, thanks! We'll probably end up doing something similar because even if we fix this dylib issue (more on that in a second...), we'll still need to support platforms that don't support the #[thread_local]
attribute.
I just discovered something interesting about that issue I linked to above: https://github.com/rampantpixels/rpmalloc/issues/29. It looks like the issue subsides if you only access TLS after the dynamic loading is complete. My hunch, given the stack trace pasted below, is that our malloc
is getting called while the dynamic loader is still doing its thing, in turn using TLS, and TLS is not supported during dynamic loading. There's a hint at reduced TLS support in dynamic libraries here ("No fast thread local storage. pthread_getspecific(3) is currently the only option."), although it seems to imply that fast TLS is never supported, while in reality, I was able to get a test working that used TLS in a dynamic library, but after that library was fully loaded.
Have you guys run into anything similar with jemalloc?
Doesn't ring a bell, but most of the tricky TLS stuff was figured out before my time. I don't know the details of TLS on OS X (actually, my memory was that their compiler people had disabled it there because they wanted to come up with a new implementation without having to worry about backwards compatibility). I think we exclusively use pthreads for our thread-local data on OS X.
Gotcha. I actually managed to solve this with a surprisingly simple approach - I have a global bool that's initialized to false and then set to true in the library constructor. The TLS code checks that boolean and uses a global slow path if it's false (avoiding accessing TLS).
On platforms without the
#[thread_local]
attribute, thread-local storage (TLS) for an allocator is particularly tricky because most obvious implementations (including the one implemented in the standard library) require allocation to work, and do so in such a way that detecting what state a TLS value is in requires allocation, meaning that recursion (caused by an allocation call accessing TLS, triggering further allocation) cannot be detected. While it's possible that this will be fixed at some point in the future, it won't be fixed any time soon.Inspired by this comment on a jemalloc issue, our best bet may be to register a handler that is called by pthread whenever a thread spawns, and use this opportunity to preemptively initialize TLS for that thread.