Closed matklad closed 1 year ago
Ahhhhhh yesss! My secret plan to have someone review this code worked. :D
Could I convince you to review Pool
too? It's not quite the same as Lazy
, but it is similarish. And I just caught a soundness hole in it a few weeks ago. The last three tests in src/util/pool.rs
tell that story.
We don't have a spnning no-alloc version though, as I somewhat firmly believe that, for all code that could spin lock wihtout yielding to OS/blocking interrupts/etc, the caller should explicitly type out .spin somewhere)
Yeah, I've read your blog post a few times and I find it pretty compelling. Buuuuuuuuuuut, I don't really know what else to do. For Lazy
, the core-only functionality is there so that folks in no-std no-alloc environments have an easier path. Namely, one of the main use cases of regex-automata
is the ability to serialize a DFA, and then deserialize it in a much more constrained environment (no-std and no-alloc). I actually don't have a ton of experience in said environments, so I don't quite know just how useful having something like a Lazy
type is. But, it certainly makes things like regex-cli generate serialize
easier because it just write out Rust code that uses library types from regex-automata
to make a DFA globally available with zero fuss. But, I don't feel too bad about this, because nobody has to use Lazy
. It isn't used inside the regex internals anywhere. (Currently, anyway. There are use cases for it, but for a variety of reasons unrelated to Lazy
, I haven't been able to make it work as good as I've wanted.)
With that said, I'm also using a spinlock in Pool
, and that gets used when only std
is disabled. So if you have alloc
but not std
, then you get a Pool
with a spinlock. (Pool
is unavailable in no-std and no-alloc environments because it's pretty heavily tied to the idea of allocating something.) In practice where this is going to show up is when folks disable the std
feature in the regex
crate (which folks have been asking for for a while). So when they do that, a call to say, Regex::find
, will ask the pool for some memory to run a search. And that's going to hit the spinlock.
Now, in regex-automata, there is an escape hatch for all of this. You can choose to use a lower level API that lets you pass your own mutable memory, which means the regex doesn't need to talk to a pool at all. But, regex
proper doesn't have those APIs, and I don't really want it to have those APIs because it fattens up an already big API for a very niche use case. Maybe eventually I'll cave on that point, but at least at present, I don't feel strongly enough about spinlocks to choose a different point in the design space as I currently conceive it.
I had also been hoping to get to the section on spinlocks in @m-ou-se's book (pinging you on the off chance you want to chime in here) to see if there was any extra insight to be gathered there. I am traveling tomorrow, so I might actually get that chance!
Mara: big picture thing here is that, at some point in the next few months, the regex
crate will provide a way to build without std
. When it does this, it still needs to use a synchronization primitive to get some mutable scratch space in order to execute a search. Basically, think of it as a Mutex<Vec<SomeRegexMemory>>
. It plucks something out at the start of a search, does some reading & writing to it, and then stuffs it back in when it's done so that another search can reuse it. Now with std
enabled, it does actually use a Mutex
(with some other tricks). But when you disable std
, currently, my plan is to have it use a spinlock because I don't really know what else to do if I want to keep providing the nice high level Regex::find
APIs (and I do).
@matklad Also, I've pushed up fixes to your feedback. Thank you. I've also added big warnings to Pool
and Lazy
(with a linkt to your blog post).
This is the current code that implements the spinlock mentioned above for a Pool
:
Mara: big picture thing here is that, at some point in the next few months, the
regex
crate will provide a way to build withoutstd
. When it does this, it still needs to use a synchronization primitive to get some mutable scratch space in order to execute a search. Basically, think of it as aMutex<Vec<SomeRegexMemory>>
. It plucks something out at the start of a search, does some reading & writing to it, and then stuffs it back in when it's done so that another search can reuse it. Now withstd
enabled, it does actually use aMutex
(with some other tricks). But when you disablestd
, currently, my plan is to have it use a spinlock because I don't really know what else to do if I want to keep providing the nice high levelRegex::find
APIs (and I do).
If that Vec is just used as a stack, to which you only push and pop only one element at a time that will only be used by one thread, it might be simple to do that lock free. :)
(Lock free lists are usually quite tricky, but I think your use case might be one that doesn't run into most of the common tricky issues.)
My general stance on spinlocks is driven primarily by aesthetic reasons: it's no that I debugged some curious latency spikes in a server, it's just that I understand the mechanics of how spin locks could lead to issues, and that shifts me to "it's not up to the library to make this choice". Maybe in practice it is actually fine! But maybe not: here's someone hitting an actual deadlock in the crossbeam. I guess that issue can be used to create a litmus test here:
no_std
regex in user-land code (eg, they are using some library which doesn't need and doesn't have std
version, and that library depends on Regex
with default-features = false
).That does seem quite unlikely, but also is not impossible, and the result (deadlock) is almost as bad as it can go. Is it ok for a library to potentially open the used to this scenario? On the aesthetic ground definitely not! On practical grounds, it might actually be yes?
One thing I've realized here is that no_std+alloc is in some sense an incoherent world. If you have a global allocator, you must have a proper implementation of Mutex to protect it (that impl can be a no-op for threadless environments). This is somewhat ironic in this situation --- you are building a globabl memory pool, a specialized version of a global allocator, you need the same kinds of tools to build that, the tools are actually in your address space (as they are used by the global allocator), but you can't reach them!
I think for pool what Mara suggests makes the most sense. Lock-free pool + racy lazy should be totally fine. I am not sure what to do with no_std no_alloc lazy... I see that it is used to intialized an &[u32]. As the initialization is deterministic, I am wondering if it would be correct to do something like:
if !initialized.load(Acquire) {
for byte in bytes.iter() { byte.store(deterministic_value, Relaxed) }
initialized.store(true, Release)
}
for byte in bytes.iter() {
byte.load(NonAtomic);
}
That is, threads race to overwrite the byte array, but, because they are guaranteed to write the same value, it is actually ok? My reading of C++ memory model tells me that no, this won't be ok. On the high level, I believe the memory model doesn't have anything specific about "same value", so reasoning relying on that would we wrong. More specifically, as we ultimately want to do non-atomic reads, then our atomic relaxed writes do nothing. And, if I remove that relaxed store, it seems clear that this is UB --- you have non-atomic read and write racing with each other, and that's definitely UB, even if the write overwites the same value.
Curiously though, if you also make sure that reads are atomic relaxed, that should be fine! (but then you'd have to return a &[AtomicU32]
).
pub static {name}: Lazy<DFA<&'static [u32]>> = Lazy::new(|| {{
{deserialize}
}});
I'd maybe write this as
pub fn {name}() DFA<&'static [u32]> {
static LAZY = Lazy<DFA<&'static [u32]>> = Lazy::new(|| {{
{deserialize}
}});
&*LAZY
}
to not expose specific synchronization in the API.
As to what to do here in general, I think there are these choices:
Figuring out what the users actually want when they ask for "no_std regex". It might be either of the two:
My personal stance here is that no_std is a different, resource constrained environment and so in the general case it just needs to have a different set of APIs. This makes composing libraries hard, but, if you care about resource-constrained environments, you probably care about zero-cost abstractions more than you care about seamless composition and would be fine with writing some amount of glue code yourself.
extern "C" { fn hook(); }
declaration in the upstream crate which is satisfied by the downstream crate and ultimately resolved by the linker. lazy_static
I believe falls back to spinlocks for no_std, so, by (optionally) depending on it, you absolve your own crate from questionable code. "If you just want spinlocks, use lazy_static with all of the caveats" I think is a fine strategy.Here's a lock-free-ish pool that I think suits your needs: https://gist.github.com/m-ou-se/5fdcbdf7dcf4585199ce2de697f367a4
It does still (very briefly) lock the stack when popping an element, but it does not lock to push an element, nor does it keep anything locked while allocating and creating a new element.
@m-ou-se thank you! The reason why i hadn't used the approach of a lock free structure was indeed because of the ABA problem. It looks like you get around that by still using a spin lock. Does that still fall afoul of the problems with spin locks in general though? (Yours has much smaller scope than mine, of course.)
A spinlock isn't generally bad. A regular Mutex
also spins for quite a while before it blocks. Using a spinlock when it can sometimes spin for a long time is where it starts becoming a bad idea. In this case, it is only locked between the compare-exchange and the store that immediately follows, not while performing any allocations or syscalls etc. (An Arc
does exactly the same in Arc::downgrade()
.)
(But as usual, only a benchmark will provide real answers. ^^)
It looks like you get around that by still using a spin lock.
(I would say "by using spinning", not "by using a spin lock". A SpinLock<Vec<T>>
would need to be locked on every access. My implementation only briefly 'locks' for popping, not for pushing or allocating (or resizing a Vec).)
Hmmm interesting. I guess what I had in mind is whether @matklad's deadlock scenario can happen in your pool.
@matklad Thinking about it more, why doesn't the core::hint::spin_loop()
hint prevent the deadlock from happening? I assume the deadlock happens because the OS doesn't know user code is in the spin lock, but I guess core::hint::spin_loop()
should resolve that?
I assume the deadlock happens because the OS doesn't know user code is in the spin lock, but I guess
core::hint::spin_loop()
should resolve that?
A spin loop hint doesn't communicate with the operating system. It's basically just a powerful "200 × nop
" that also instructs the cpu to switch to the other hyperthread of the same core, depending on the architecture.
(The operating system equivalent is std::thread::yield_now()
, but if you are going to do such a syscall you should consider just doing a futex/locking syscall instead (i.e. use a Mutex).)
I guess what I had in mind is whether @matklad's deadlock scenario can happen in your pool.
Yeah, I believe it could happen if:
https://github.com/matklad/spin-of-death/blob/master/src/main.rs (from my post) might be a good starting point to create the above situation.
A spinlock isn't generally bad.
I think it's important to make a distinction between "spinning a bit before syscall" (awesome perf optimization!) and an actual spinlock, where we burn the CPU, blocked on the progress of some other thread (can create a deadlock if thread priorities are unfair).
That deadlock situation is one that could happen when a thread gets preempted after it has locked but before it has unlocked the list, the newly running thread starts waiting for the list to be unlocked, and the scheduler will never preempt the second thread to let the first one make progress.
This won't happen with purely cooperative scheduling (only switching threads on yield/syscall), since there is no syscall or any user code between the lock and unlock. (Unlike with a SpinLock<Vec<T>>
, where a lot can happen while it is locked.) Nor will it happen with 'normal' preemptive scheduling. It could happen when using the same regex in a main thread and a signal handler, interrupt handler or other thread when that second thread fully blocks the main thread until it finishes (e.g. when using priorities and real time scheduling without limits). This is a rare scenario, but not impossible/unthinkable in some specific (embedded/low level) applications.
This is avoided by putting a maximum on the number of spin cycles and falling back to something else. If you fall back to blocking using the operating system, you've implemented a mutex. But in this case you could also just fall back to not using the pool and allocating/deallocating directly.
Note that using malloc/Box::new in a signal handler on Unix is actually not allowed, because it can just result in the exact same deadlock with the lock(s) in the allocator implementation, so that doesn't really solve anything for this case. In other words, for Unix signal handlers, not using the Pool
and always doing Box::new
would result in the exact same issue. Doing anything that allocates in a signal handler is simply not supported.
So what's left is (non-preempted/"real time") threads with different priorities, and custom/embedded systems that do not use libc but have their own allocator that is safe to be used in such situations. In those cases, the Pool::get
I sketched indeed has an issue that Box::new
wouldn't have.
To solve that issue we need to interact with the operating system in some way, which is basically what we're trying to avoid in no_std
. The only real operating system operations we have are the ones from the alloc
crate: allocation and deallocation. So if the options for the fallback for spinning for too long are: a mutex, a yield() call, thread parking, a futex call, or Box::new
/drop
, then falling back to Box::new
/drop
after e.g. a 100 spin cycles might be the most obvious option. Not the most performant, but maybe that's okay for a rare situation. (Such trade offs are always super tricky to make.)
Note that alloc::sync::Arc
's downgrade()
and get_mut()
have the exact same problem in those situations. I'll leave it up to you to decide whether that means that "Arc
has a bug" or that "those situations are apparently unsupported if you use alloc
" and what that means for regex
. :)
Got this to deadlock without signals, just using realtime thread priorities:
https://github.com/matklad/spin-of-death/commit/7ed9770913ddeda6188c752bab47fe6c142422e9
But yeah, it indeed seems like Arc::downgrade
can deadlock in a similar way if a different lower priority thread does get_mut
in a bad moment.
Aye, thanks for this everyone! I think I'm just going to swap out my alloc-only implementation of Pool
with @m-ou-se's from above. I think I'm comfortable being in a similar spot as Arc::downgrade
in exchange for keeping the nicer higher level API. And I'll add warnings to the docs. I also feel okay about this because if someone is concerned about it, they will have an escape route by using lower level APIs that don't make use of a Pool
at all.
Just couldn't resist taking a look at a couple of safety issues that have burnt me in the past in this area :-)
This type
https://github.com/BurntSushi/regex-automata/blob/1d8f4d67e57bba5146627d5f8625db2abb925175/src/util/lazy.rs#L110
has two problems:
First, it misses
unsafe impl<T: Sync + Send, F: Sync> Sync for Lazy<T, F>
(the inferred impl misses + Send). This can be weaponized to turn Lazy into a channel for sending Sync+!Send types across threads:Second, the type owns
T
(it can drop it), but this isn't visible to compiler. That is,PhantomData<Box<T>>
is missing. In principle, this should be weaponizable by subverting dropcheck. If the compiler believes that droppingLazy
can't dropT
, then it might allow code where the T in lazy has some dangling pointers when the Lazy is dropped, something along the lines ofNo, this exact code compiles, because, although compiler doesn't see that we can drop
T
, it sees that we can dropF
, and those share the problematic lifetime. I am maybe 70% confident that, because of this, there's no way too weaponize it, but still adding phantom data would seem like a good idea.(newer versions of OnceCell crate also have this version of synchronization primitive: https://docs.rs/once_cell/latest/once_cell/race/struct.OnceBox.html. We don't have a spnning no-alloc version though, as I somewhat firmly believe that, for all code that could spin lock wihtout yielding to OS/blocking interrupts/etc, the caller should explicitly type out
.spin
somewhere)