BurntSushi / regex-automata

A low level regular expression library that uses deterministic finite automata.
The Unlicense
351 stars 26 forks source link

On a wip branch, there are some subtle soundness issues with Lazy #30

Closed matklad closed 1 year ago

matklad commented 1 year ago

Just couldn't resist taking a look at a couple of safety issues that have burnt me in the past in this area :-)

This type

https://github.com/BurntSushi/regex-automata/blob/1d8f4d67e57bba5146627d5f8625db2abb925175/src/util/lazy.rs#L110

has two problems:

First, it misses unsafe impl<T: Sync + Send, F: Sync> Sync for Lazy<T, F> (the inferred impl misses + Send). This can be weaponized to turn Lazy into a channel for sending Sync+!Send types across threads:

fn sync_not_send<T: Sync + Default>() {
    let lazy = Lazy::new(move || T::default());
    std::thread::scope(|scope| {
        scope.spawn(|| {
            lazy.get(); // We create T in this thread
        });
    });
    // And drop in this thread.
    drop(lazy);
    // So we have send a !Send type over threads. (with some more legwork, its
    // possible to even sneak the value out of drop through thread local)
}

Second, the type owns T (it can drop it), but this isn't visible to compiler. That is, PhantomData<Box<T>> is missing. In principle, this should be weaponizable by subverting dropcheck. If the compiler believes that dropping Lazy can't drop T, then it might allow code where the T in lazy has some dangling pointers when the Lazy is dropped, something along the lines of

fn no_dropchk() {
    struct Sneaky<'a>(&'a str);
    impl<'a> Drop for Sneaky<'a> {
        fn drop(&mut self) {
            println!(".0 = {}", self.0);
        }
    }

    let lazy;
    let s = " hello".to_string();
    lazy = Lazy::new(|| Sneaky(s.as_str()));
}

No, this exact code compiles, because, although compiler doesn't see that we can drop T, it sees that we can drop F, and those share the problematic lifetime. I am maybe 70% confident that, because of this, there's no way too weaponize it, but still adding phantom data would seem like a good idea.

(newer versions of OnceCell crate also have this version of synchronization primitive: https://docs.rs/once_cell/latest/once_cell/race/struct.OnceBox.html. We don't have a spnning no-alloc version though, as I somewhat firmly believe that, for all code that could spin lock wihtout yielding to OS/blocking interrupts/etc, the caller should explicitly type out .spin somewhere)

BurntSushi commented 1 year ago

Ahhhhhh yesss! My secret plan to have someone review this code worked. :D

Could I convince you to review Pool too? It's not quite the same as Lazy, but it is similarish. And I just caught a soundness hole in it a few weeks ago. The last three tests in src/util/pool.rs tell that story.

We don't have a spnning no-alloc version though, as I somewhat firmly believe that, for all code that could spin lock wihtout yielding to OS/blocking interrupts/etc, the caller should explicitly type out .spin somewhere)

Yeah, I've read your blog post a few times and I find it pretty compelling. Buuuuuuuuuuut, I don't really know what else to do. For Lazy, the core-only functionality is there so that folks in no-std no-alloc environments have an easier path. Namely, one of the main use cases of regex-automata is the ability to serialize a DFA, and then deserialize it in a much more constrained environment (no-std and no-alloc). I actually don't have a ton of experience in said environments, so I don't quite know just how useful having something like a Lazy type is. But, it certainly makes things like regex-cli generate serialize easier because it just write out Rust code that uses library types from regex-automata to make a DFA globally available with zero fuss. But, I don't feel too bad about this, because nobody has to use Lazy. It isn't used inside the regex internals anywhere. (Currently, anyway. There are use cases for it, but for a variety of reasons unrelated to Lazy, I haven't been able to make it work as good as I've wanted.)

With that said, I'm also using a spinlock in Pool, and that gets used when only std is disabled. So if you have alloc but not std, then you get a Pool with a spinlock. (Pool is unavailable in no-std and no-alloc environments because it's pretty heavily tied to the idea of allocating something.) In practice where this is going to show up is when folks disable the std feature in the regex crate (which folks have been asking for for a while). So when they do that, a call to say, Regex::find, will ask the pool for some memory to run a search. And that's going to hit the spinlock.

Now, in regex-automata, there is an escape hatch for all of this. You can choose to use a lower level API that lets you pass your own mutable memory, which means the regex doesn't need to talk to a pool at all. But, regex proper doesn't have those APIs, and I don't really want it to have those APIs because it fattens up an already big API for a very niche use case. Maybe eventually I'll cave on that point, but at least at present, I don't feel strongly enough about spinlocks to choose a different point in the design space as I currently conceive it.

I had also been hoping to get to the section on spinlocks in @m-ou-se's book (pinging you on the off chance you want to chime in here) to see if there was any extra insight to be gathered there. I am traveling tomorrow, so I might actually get that chance!

Mara: big picture thing here is that, at some point in the next few months, the regex crate will provide a way to build without std. When it does this, it still needs to use a synchronization primitive to get some mutable scratch space in order to execute a search. Basically, think of it as a Mutex<Vec<SomeRegexMemory>>. It plucks something out at the start of a search, does some reading & writing to it, and then stuffs it back in when it's done so that another search can reuse it. Now with std enabled, it does actually use a Mutex (with some other tricks). But when you disable std, currently, my plan is to have it use a spinlock because I don't really know what else to do if I want to keep providing the nice high level Regex::find APIs (and I do).

BurntSushi commented 1 year ago

@matklad Also, I've pushed up fixes to your feedback. Thank you. I've also added big warnings to Pool and Lazy (with a linkt to your blog post).

BurntSushi commented 1 year ago

This is the current code that implements the spinlock mentioned above for a Pool:

https://github.com/BurntSushi/regex-automata/blob/392b1569c1459a11a151b2e31756afdf01723c3e/src/util/pool.rs#L729-L815

m-ou-se commented 1 year ago

Mara: big picture thing here is that, at some point in the next few months, the regex crate will provide a way to build without std. When it does this, it still needs to use a synchronization primitive to get some mutable scratch space in order to execute a search. Basically, think of it as a Mutex<Vec<SomeRegexMemory>>. It plucks something out at the start of a search, does some reading & writing to it, and then stuffs it back in when it's done so that another search can reuse it. Now with std enabled, it does actually use a Mutex (with some other tricks). But when you disable std, currently, my plan is to have it use a spinlock because I don't really know what else to do if I want to keep providing the nice high level Regex::find APIs (and I do).

If that Vec is just used as a stack, to which you only push and pop only one element at a time that will only be used by one thread, it might be simple to do that lock free. :)

(Lock free lists are usually quite tricky, but I think your use case might be one that doesn't run into most of the common tricky issues.)

matklad commented 1 year ago

My general stance on spinlocks is driven primarily by aesthetic reasons: it's no that I debugged some curious latency spikes in a server, it's just that I understand the mechanics of how spin locks could lead to issues, and that shifts me to "it's not up to the library to make this choice". Maybe in practice it is actually fine! But maybe not: here's someone hitting an actual deadlock in the crossbeam. I guess that issue can be used to create a litmus test here:

That does seem quite unlikely, but also is not impossible, and the result (deadlock) is almost as bad as it can go. Is it ok for a library to potentially open the used to this scenario? On the aesthetic ground definitely not! On practical grounds, it might actually be yes?


One thing I've realized here is that no_std+alloc is in some sense an incoherent world. If you have a global allocator, you must have a proper implementation of Mutex to protect it (that impl can be a no-op for threadless environments). This is somewhat ironic in this situation --- you are building a globabl memory pool, a specialized version of a global allocator, you need the same kinds of tools to build that, the tools are actually in your address space (as they are used by the global allocator), but you can't reach them!


I think for pool what Mara suggests makes the most sense. Lock-free pool + racy lazy should be totally fine. I am not sure what to do with no_std no_alloc lazy... I see that it is used to intialized an &[u32]. As the initialization is deterministic, I am wondering if it would be correct to do something like:

if !initialized.load(Acquire) {
  for byte in bytes.iter() { byte.store(deterministic_value, Relaxed) }
  initialized.store(true, Release)
}
for byte in bytes.iter() {
  byte.load(NonAtomic);
}

That is, threads race to overwrite the byte array, but, because they are guaranteed to write the same value, it is actually ok? My reading of C++ memory model tells me that no, this won't be ok. On the high level, I believe the memory model doesn't have anything specific about "same value", so reasoning relying on that would we wrong. More specifically, as we ultimately want to do non-atomic reads, then our atomic relaxed writes do nothing. And, if I remove that relaxed store, it seems clear that this is UB --- you have non-atomic read and write racing with each other, and that's definitely UB, even if the write overwites the same value.

Curiously though, if you also make sure that reads are atomic relaxed, that should be fine! (but then you'd have to return a &[AtomicU32]).


pub static {name}: Lazy<DFA<&'static [u32]>> = Lazy::new(|| {{
    {deserialize}
}});

I'd maybe write this as

pub fn {name}() DFA<&'static [u32]> {
    static LAZY = Lazy<DFA<&'static [u32]>> = Lazy::new(|| {{
        {deserialize}
    }});
    &*LAZY
}

to not expose specific synchronization in the API.


As to what to do here in general, I think there are these choices:

m-ou-se commented 1 year ago

Here's a lock-free-ish pool that I think suits your needs: https://gist.github.com/m-ou-se/5fdcbdf7dcf4585199ce2de697f367a4

It does still (very briefly) lock the stack when popping an element, but it does not lock to push an element, nor does it keep anything locked while allocating and creating a new element.

BurntSushi commented 1 year ago

@m-ou-se thank you! The reason why i hadn't used the approach of a lock free structure was indeed because of the ABA problem. It looks like you get around that by still using a spin lock. Does that still fall afoul of the problems with spin locks in general though? (Yours has much smaller scope than mine, of course.)

m-ou-se commented 1 year ago

A spinlock isn't generally bad. A regular Mutex also spins for quite a while before it blocks. Using a spinlock when it can sometimes spin for a long time is where it starts becoming a bad idea. In this case, it is only locked between the compare-exchange and the store that immediately follows, not while performing any allocations or syscalls etc. (An Arc does exactly the same in Arc::downgrade().)

m-ou-se commented 1 year ago

(But as usual, only a benchmark will provide real answers. ^^)

m-ou-se commented 1 year ago

It looks like you get around that by still using a spin lock.

(I would say "by using spinning", not "by using a spin lock". A SpinLock<Vec<T>> would need to be locked on every access. My implementation only briefly 'locks' for popping, not for pushing or allocating (or resizing a Vec).)

BurntSushi commented 1 year ago

Hmmm interesting. I guess what I had in mind is whether @matklad's deadlock scenario can happen in your pool.

@matklad Thinking about it more, why doesn't the core::hint::spin_loop() hint prevent the deadlock from happening? I assume the deadlock happens because the OS doesn't know user code is in the spin lock, but I guess core::hint::spin_loop() should resolve that?

m-ou-se commented 1 year ago

I assume the deadlock happens because the OS doesn't know user code is in the spin lock, but I guess core::hint::spin_loop() should resolve that?

A spin loop hint doesn't communicate with the operating system. It's basically just a powerful "200 × nop" that also instructs the cpu to switch to the other hyperthread of the same core, depending on the architecture.

(The operating system equivalent is std::thread::yield_now(), but if you are going to do such a syscall you should consider just doing a futex/locking syscall instead (i.e. use a Mutex).)

matklad commented 1 year ago

I guess what I had in mind is whether @matklad's deadlock scenario can happen in your pool.

Yeah, I believe it could happen if:

https://github.com/matklad/spin-of-death/blob/master/src/main.rs (from my post) might be a good starting point to create the above situation.

A spinlock isn't generally bad.

I think it's important to make a distinction between "spinning a bit before syscall" (awesome perf optimization!) and an actual spinlock, where we burn the CPU, blocked on the progress of some other thread (can create a deadlock if thread priorities are unfair).

m-ou-se commented 1 year ago

That deadlock situation is one that could happen when a thread gets preempted after it has locked but before it has unlocked the list, the newly running thread starts waiting for the list to be unlocked, and the scheduler will never preempt the second thread to let the first one make progress.

This won't happen with purely cooperative scheduling (only switching threads on yield/syscall), since there is no syscall or any user code between the lock and unlock. (Unlike with a SpinLock<Vec<T>>, where a lot can happen while it is locked.) Nor will it happen with 'normal' preemptive scheduling. It could happen when using the same regex in a main thread and a signal handler, interrupt handler or other thread when that second thread fully blocks the main thread until it finishes (e.g. when using priorities and real time scheduling without limits). This is a rare scenario, but not impossible/unthinkable in some specific (embedded/low level) applications.

This is avoided by putting a maximum on the number of spin cycles and falling back to something else. If you fall back to blocking using the operating system, you've implemented a mutex. But in this case you could also just fall back to not using the pool and allocating/deallocating directly.

Note that using malloc/Box::new in a signal handler on Unix is actually not allowed, because it can just result in the exact same deadlock with the lock(s) in the allocator implementation, so that doesn't really solve anything for this case. In other words, for Unix signal handlers, not using the Pool and always doing Box::new would result in the exact same issue. Doing anything that allocates in a signal handler is simply not supported.

So what's left is (non-preempted/"real time") threads with different priorities, and custom/embedded systems that do not use libc but have their own allocator that is safe to be used in such situations. In those cases, the Pool::get I sketched indeed has an issue that Box::new wouldn't have.

To solve that issue we need to interact with the operating system in some way, which is basically what we're trying to avoid in no_std. The only real operating system operations we have are the ones from the alloc crate: allocation and deallocation. So if the options for the fallback for spinning for too long are: a mutex, a yield() call, thread parking, a futex call, or Box::new/drop, then falling back to Box::new/drop after e.g. a 100 spin cycles might be the most obvious option. Not the most performant, but maybe that's okay for a rare situation. (Such trade offs are always super tricky to make.)

m-ou-se commented 1 year ago

Note that alloc::sync::Arc's downgrade() and get_mut() have the exact same problem in those situations. I'll leave it up to you to decide whether that means that "Arc has a bug" or that "those situations are apparently unsupported if you use alloc" and what that means for regex. :)

matklad commented 1 year ago

Got this to deadlock without signals, just using realtime thread priorities:

https://github.com/matklad/spin-of-death/commit/7ed9770913ddeda6188c752bab47fe6c142422e9

But yeah, it indeed seems like Arc::downgrade can deadlock in a similar way if a different lower priority thread does get_mut in a bad moment.

BurntSushi commented 1 year ago

Aye, thanks for this everyone! I think I'm just going to swap out my alloc-only implementation of Pool with @m-ou-se's from above. I think I'm comfortable being in a similar spot as Arc::downgrade in exchange for keeping the nicer higher level API. And I'll add warnings to the docs. I also feel okay about this because if someone is concerned about it, they will have an escape route by using lower level APIs that don't make use of a Pool at all.