no_std compatible crossbeam-deque

jrobsonchase commented 5 years ago

Hi! I'm working on a Futures executor for no_std environments and was looking for a lock-free queue when I found crossbeam-deque, which is unfortunately not no_std.

It seems to me that it's possible to add variants of all of the methods that call epoch::pin and epoch::is_pinned functions which instead take an explicit &LocalHandle argument to do the pinning. The original methods could then use the no_std variants under the hood via the crossbeam_epoch::with_handle function. The only other thing I see that has an std requirement is Backoff::snooze which calls thread::yield_now. Maybe its body could just call spin if compiled without std?

I've got this pretty much all implemented locally and would be happy to clean it up and submit a PR, but I have some API/documentation questions.

Would it be better to have separate types for std/no_std environments or to have them unified and instead use *_with_handle methods next to the original ones?
Duplicating all of these methods for std/no_std could also mean duplicating documentation. Would it instead be better to simply link to the no_std methods from their std counterparts so that we wouldn't have two places to update descriptions/examples?

ghost commented 5 years ago

I'm already preparing two PRs that:

Make Backoff work in #[no_std].
Remove crossbeam-epoch from crossbeam-deque and replace it with simple hazard pointers.

Your approach would work as well, but since I've been planning to remove crossbeam-epoch anyway, perhaps we should just do it now :)

Question: Does your #[no_std] environment have a heap allocator? I suppose so, but just double checking.

jrobsonchase commented 5 years ago

Ah, cool! Is the code pushed somewhere? I'd love to give it a go!

Yeah, I've got an allocator, no problems there :)

ghost commented 5 years ago

Not yet, but hopefully I'll push it this week!

ghost commented 5 years ago

What do you think is the best way to differentiate threads in no_std without TLS?

One of the solutions is as you suggested, passing a &LocalHandle (or any other kind of thread identifier) to every function that would otherwise have to access TLS.

Unfortunately, this duplicates the whole API just for no_std support. I wonder if there is a more elegant way. Any ideas?

How about using the thread_id crate - would that be a viable approach in your case?

jrobsonchase commented 5 years ago

thread_id only seems to have implementations that use a syscall, either directly in the case of redox or via the win/posix APIs.

On cortex-m, there's the SCB which has a register containing the active operating mode (thread, exception, or interrupt): VectActive. In my limited experience, the IRQn seems like it would be sufficient to identify the current "thread." You could avoid the trait abstraction if support for cortex-m was added to thread_id, but there are other platforms that would also need to be added if they were to be supported.

Maybe an extra layer of abstraction?

trait ThreadId {
    /// Get a unique identifier for the current thread.
    fn get() -> usize;
}

Then you could include it as PhantomData for a the generic case and add a default specialization for the common cases.

struct ThingThatNeedsAThread<T: ThreadId> {
    ...,
    _ph: PhantomData<fn() -> T>,
}

#[cfg(any(target-os = "redox", windows, unix))]
{
    extern crate thread_id;

    struct SyscallId;

    impl ThreadId for SyscallId {
        fn get() -> usize {
            thread_id::get()
        }
    }

    pub type ThingThatMostPeopleUse = ThingThatNeedsAThread<SyscallId>;
}

Or something along those lines.

Amanieu commented 5 years ago

In crossbeam-skiplist the no_std API requires you to pass in a &Guard to all functions. How you obtain a &Guard is the user's responsibility (normally you would get it from your thread-local LocalHandle).

ghost commented 5 years ago

@jrobsonchase I wonder: is there anything wrong with using syscalls to determine the current thread ID? It seems to me like it's the most convenient solution for users of this library.

@Amanieu Do you perhaps recall why we went with that approach? If we made it possible to use the default epoch GC in #[no_std] by simulating our own TLS (since thread_local! is unavailable) using syscalls, would that work? Any drawbacks?

Amanieu commented 5 years ago

The reason I went with that approach is that I (ab)use crossbeam-epoch as a QSBR system in which I have very long-lived Guards and I wanted to avoid going through LocalHandle to get a new one for each skiplist operation.

jrobsonchase commented 5 years ago

@stjepang well, syscalls assume the presence of some OS to service them, which you're not going to have on embedded devices, which afaik is the main usecase for no_std.

I came up with my no_std strategy for crossbeam-deque after looking at crossbeam-skiplist actually. I was originally going to use a Guard before realizing that you could allow it to be more granular by passing a LocalHandle instead.

ghost commented 5 years ago

Interesting, thanks for clarification!

Can you tell more about the specs of the embedded device? More specifically:

How many cores/threads there are?
How much memory is available?

Ok, so in case of crossbeam-deque, we need a function get_id() -> usize that returns some kind of identifier for the current thread. However:

It's okay if get_id() returns the same ID on different threads with very low probability.
It's okay if get_id() returns different IDs on the same thread with very low probability.
Operations invoked on the same thread in quick succession should use the same ID with very high probability.

Using the CPU identifier (like IRQn, as @jrobsonchase mentioned) is a good guess for identifying the current thread. Not correct every single time (context switching!), but works in most cases.

If we use this ID to disperse threads among a bunch of hazard pointer slots, there should be little contention in practice.

How crazy would it be to use the current stack pointer address and divide it by page size to choose the hazard pointer slot? For example:

fn get_id() -> usize {
    let var = &mut 0u8;
    let addr = var as *mut u8 as usize;
    addr / 4096
}

fn hazard_index() -> usize {
    get_id() % NUM_HAZARD_PTRS
}

This idea is similar to how we disperse AtomicCells among a bunch of hidden global spinlocks. What do you think?

jrobsonchase commented 5 years ago

How many cores/threads there are?

I think the specific device I'm working with only has one core/thread, but interrupts are kind of conceptually "threads" in many regards. Correct Send/Sync bounds are required to ensure correct concurrency when interrupts are involved. Race conditions between the main context and interrupts are very much a thing if you're not careful, and not a lot of fun to debug :stuck_out_tongue:

How much memory is available?

Mine has 64k (or 128k sometimes) flash and 20k RAM.

The device in question. Basically a BluePill but "better."

From my understanding, cortex-m is a pretty large family of devices, so mine isn't representative of them as a whole. Not sure if multicore variants exist.

I think that your proposed stack pointer approach wouldn't work when interrupts are involved since I think they share the same stack space as the main code, so you'd frequently get the same id in code running just before an interrupt fires and in the interrupt itself. It's also possible (and likely) that the interrupt stack is dependent on the platform and is handled differently for different architectures.

jrobsonchase commented 5 years ago

I should also note that with the way I'm using crossbeam-deque currently, I don't even need any of the GC-requiring functionality. I'm only using the Injector which works perfectly well when I've got a single-threaded event loop receiving wakeup requests from interrupts.

ghost commented 5 years ago

I think that your proposed stack pointer approach wouldn't work when interrupts are involved since I think they share the same stack space as the main code, so you'd frequently get the same id in code running just before an interrupt fires and in the interrupt itself.

Just to elaborate a little bit, if a thread notices that the chosen hazard pointer slot is already occupied by another thread, it would try grabbing another slot until it finds a vacant one.

I don't even need any of the GC-requiring functionality. I'm only using the Injector which works perfectly well when I've got a single-threaded event loop receiving wakeup requests from interrupts.

Really? Are you not even using Worker? :) If Injector is all you need, then just using SegQueue would be the way to go. It has almost the exact same implementation, we'd just need to make it compatible with #[no_std].

jrobsonchase commented 5 years ago

If Injector is all you need, then just using SegQueue would be the way to go.

Ok, I'll take a look at that!

Are you not even using Worker?

Not currently, but I don't want to rule out the possibility ;)

Just to elaborate a little bit, if a thread notices that the chosen hazard pointer slot is already occupied by another thread, it would try grabbing another slot until it finds a vacant one.

Ah, ok, so it would still be technically correct. But if that's a common occurrence, it sounds like it could result in a not-insignificant amount of overhead, which is obviously undesirable on resource-constrained systems.

jrobsonchase commented 5 years ago

Hmm, so I'm finally getting around to trying to understand the actual lock-free queueing algorithm being used rather than just blindly replacing the std-requiring bits, and I'm not actually sure that it'll work for my use case after all :confused:

It works well enough right now when I have just a simple task where queue pushes only happen from interrupts, but if there's ever a situation where a future calls LocalWaker::wake from its poll method, i.e. from the main thread, is it possible for push operations in an interrupt to get stuck in the "wait for next block to be installed" backoff loop? Any spin-wait loops in non-preemtable contexts are going to just cause a deadlock.

ghost commented 5 years ago

Ah, I see. So, data structures in Crossbeam were primarily designed to leverage parallelism in multicore machines. Your environment is a bit different becuase it has the following constraints:

Parallelism is not so much important as lock-freedom, also known as signal-safety. This is because interrupts may happen at any time and must be able to make progress without continuing the interrupted "thread".
Resources are scarce, i.e. there is typically only a single core and a rather modest amount of available memory.

Perhaps a simple lock-free SPSC queue would do the job? Do you need a bounded queue with fixed-size buffer or unbounded queue with dynamic allocation?

jrobsonchase commented 5 years ago

Part of my problem is that while I have a specific environment in mind, I want to make it usable in a more general case as well. In my current just-get-it-working implementation, I'm using a regular VecDeque wrapped in lock_api's Mutex type and requiring library consumers to provide their own RawMutex implementation.

On my platform, all the RawMutex implementation actually entails is disabling/re-enabling interrupts on lock/unlock, which is just a couple of instructions. On systems with more full-fledged mutex/threading support though, I'd prefer a strategy that avoids the potential for excess lock contention.

I looked into the std::sync::mpsc queue implementation a bit, and it looks pretty promising, aside from allocating/deallocating a Box on every push/pop, which requires the exact same "Mutex" lock on cortex-m in addition to whatever overhead the allocator incurs. But maybe it's a fine middle ground? The "inconsistent" state should never be observable from the event loop where pop is called since it'll never preempt one of the interrupts calling push.

It definitely needs to be unbounded - it supports unbounded task spawning, which means that there are at least that many possible wakers that could be enqueueing their corresponding task.

ghost commented 5 years ago

On my platform, all the RawMutex implementation actually entails is disabling/re-enabling interrupts on lock/unlock, which is just a couple of instructions

I suppose the global allocator does the same before/after allocations/deallocations because it internally uses locks as well? Could you use the same approach to prevent interrupts during queue operations?

The strategy you mention for mpsc sounds good to me! But SegQueue should work in that case just as well, too.

crossbeam-rs / crossbeam

no_std compatible crossbeam-deque #298