Open jrobsonchase opened 5 years ago
I'm already preparing two PRs that:
Backoff
work in #[no_std]
.crossbeam-epoch
from crossbeam-deque
and replace it with simple hazard pointers.Your approach would work as well, but since I've been planning to remove crossbeam-epoch
anyway, perhaps we should just do it now :)
Question: Does your #[no_std]
environment have a heap allocator? I suppose so, but just double checking.
Ah, cool! Is the code pushed somewhere? I'd love to give it a go!
Yeah, I've got an allocator, no problems there :)
Not yet, but hopefully I'll push it this week!
What do you think is the best way to differentiate threads in no_std
without TLS?
One of the solutions is as you suggested, passing a &LocalHandle
(or any other kind of thread identifier) to every function that would otherwise have to access TLS.
Unfortunately, this duplicates the whole API just for no_std
support. I wonder if there is a more elegant way. Any ideas?
How about using the thread_id
crate - would that be a viable approach in your case?
thread_id
only seems to have implementations that use a syscall, either directly in the case of redox or via the win/posix APIs.
On cortex-m, there's the SCB
which has a register containing the active operating mode (thread, exception, or interrupt): VectActive. In my limited experience, the IRQn seems like it would be sufficient to identify the current "thread." You could avoid the trait abstraction if support for cortex-m was added to thread_id
, but there are other platforms that would also need to be added if they were to be supported.
Maybe an extra layer of abstraction?
trait ThreadId {
/// Get a unique identifier for the current thread.
fn get() -> usize;
}
Then you could include it as PhantomData
for a the generic case and add a default specialization for the common cases.
struct ThingThatNeedsAThread<T: ThreadId> {
...,
_ph: PhantomData<fn() -> T>,
}
#[cfg(any(target-os = "redox", windows, unix))]
{
extern crate thread_id;
struct SyscallId;
impl ThreadId for SyscallId {
fn get() -> usize {
thread_id::get()
}
}
pub type ThingThatMostPeopleUse = ThingThatNeedsAThread<SyscallId>;
}
Or something along those lines.
In crossbeam-skiplist
the no_std
API requires you to pass in a &Guard
to all functions. How you obtain a &Guard
is the user's responsibility (normally you would get it from your thread-local LocalHandle
).
@jrobsonchase I wonder: is there anything wrong with using syscalls to determine the current thread ID? It seems to me like it's the most convenient solution for users of this library.
@Amanieu Do you perhaps recall why we went with that approach? If we made it possible to use the default epoch GC in #[no_std]
by simulating our own TLS (since thread_local!
is unavailable) using syscalls, would that work? Any drawbacks?
The reason I went with that approach is that I (ab)use crossbeam-epoch as a QSBR system in which I have very long-lived Guard
s and I wanted to avoid going through LocalHandle
to get a new one for each skiplist operation.
@stjepang well, syscalls assume the presence of some OS to service them, which you're not going to have on embedded devices, which afaik is the main usecase for no_std
.
I came up with my no_std
strategy for crossbeam-deque
after looking at crossbeam-skiplist
actually. I was originally going to use a Guard
before realizing that you could allow it to be more granular by passing a LocalHandle
instead.
Interesting, thanks for clarification!
Can you tell more about the specs of the embedded device? More specifically:
Ok, so in case of crossbeam-deque
, we need a function get_id() -> usize
that returns some kind of identifier for the current thread. However:
get_id()
returns the same ID on different threads with very low probability.get_id()
returns different IDs on the same thread with very low probability.Using the CPU identifier (like IRQn, as @jrobsonchase mentioned) is a good guess for identifying the current thread. Not correct every single time (context switching!), but works in most cases.
If we use this ID to disperse threads among a bunch of hazard pointer slots, there should be little contention in practice.
How crazy would it be to use the current stack pointer address and divide it by page size to choose the hazard pointer slot? For example:
fn get_id() -> usize {
let var = &mut 0u8;
let addr = var as *mut u8 as usize;
addr / 4096
}
fn hazard_index() -> usize {
get_id() % NUM_HAZARD_PTRS
}
This idea is similar to how we disperse AtomicCell
s among a bunch of hidden global spinlocks. What do you think?
I think the specific device I'm working with only has one core/thread, but interrupts are kind of conceptually "threads" in many regards. Correct Send
/Sync
bounds are required to ensure correct concurrency when interrupts are involved. Race conditions between the main context and interrupts are very much a thing if you're not careful, and not a lot of fun to debug :stuck_out_tongue:
Mine has 64k (or 128k sometimes) flash and 20k RAM.
The device in question. Basically a BluePill but "better."
From my understanding, cortex-m is a pretty large family of devices, so mine isn't representative of them as a whole. Not sure if multicore variants exist.
I think that your proposed stack pointer approach wouldn't work when interrupts are involved since I think they share the same stack space as the main code, so you'd frequently get the same id in code running just before an interrupt fires and in the interrupt itself. It's also possible (and likely) that the interrupt stack is dependent on the platform and is handled differently for different architectures.
I should also note that with the way I'm using crossbeam-deque
currently, I don't even need any of the GC-requiring functionality. I'm only using the Injector
which works perfectly well when I've got a single-threaded event loop receiving wakeup requests from interrupts.
I think that your proposed stack pointer approach wouldn't work when interrupts are involved since I think they share the same stack space as the main code, so you'd frequently get the same id in code running just before an interrupt fires and in the interrupt itself.
Just to elaborate a little bit, if a thread notices that the chosen hazard pointer slot is already occupied by another thread, it would try grabbing another slot until it finds a vacant one.
I don't even need any of the GC-requiring functionality. I'm only using the
Injector
which works perfectly well when I've got a single-threaded event loop receiving wakeup requests from interrupts.
Really? Are you not even using Worker
? :) If Injector
is all you need, then just using SegQueue
would be the way to go. It has almost the exact same implementation, we'd just need to make it compatible with #[no_std]
.
If Injector is all you need, then just using
SegQueue
would be the way to go.
Ok, I'll take a look at that!
Are you not even using
Worker
?
Not currently, but I don't want to rule out the possibility ;)
Just to elaborate a little bit, if a thread notices that the chosen hazard pointer slot is already occupied by another thread, it would try grabbing another slot until it finds a vacant one.
Ah, ok, so it would still be technically correct. But if that's a common occurrence, it sounds like it could result in a not-insignificant amount of overhead, which is obviously undesirable on resource-constrained systems.
Hmm, so I'm finally getting around to trying to understand the actual lock-free queueing algorithm being used rather than just blindly replacing the std-requiring bits, and I'm not actually sure that it'll work for my use case after all :confused:
It works well enough right now when I have just a simple task where queue pushes only happen from interrupts, but if there's ever a situation where a future calls LocalWaker::wake
from its poll
method, i.e. from the main thread, is it possible for push operations in an interrupt to get stuck in the "wait for next block to be installed" backoff loop? Any spin-wait loops in non-preemtable contexts are going to just cause a deadlock.
Ah, I see. So, data structures in Crossbeam were primarily designed to leverage parallelism in multicore machines. Your environment is a bit different becuase it has the following constraints:
Parallelism is not so much important as lock-freedom, also known as signal-safety. This is because interrupts may happen at any time and must be able to make progress without continuing the interrupted "thread".
Resources are scarce, i.e. there is typically only a single core and a rather modest amount of available memory.
Perhaps a simple lock-free SPSC queue would do the job? Do you need a bounded queue with fixed-size buffer or unbounded queue with dynamic allocation?
Part of my problem is that while I have a specific environment in mind, I want to make it usable in a more general case as well. In my current just-get-it-working implementation, I'm using a regular VecDeque
wrapped in lock_api's Mutex
type and requiring library consumers to provide their own RawMutex
implementation.
On my platform, all the RawMutex
implementation actually entails is disabling/re-enabling interrupts on lock/unlock, which is just a couple of instructions. On systems with more full-fledged mutex/threading support though, I'd prefer a strategy that avoids the potential for excess lock contention.
I looked into the std::sync::mpsc
queue implementation a bit, and it looks pretty promising, aside from allocating/deallocating a Box
on every push/pop, which requires the exact same "Mutex" lock on cortex-m in addition to whatever overhead the allocator incurs. But maybe it's a fine middle ground? The "inconsistent" state should never be observable from the event loop where pop
is called since it'll never preempt one of the interrupts calling push
.
It definitely needs to be unbounded - it supports unbounded task spawning, which means that there are at least that many possible wakers that could be enqueueing their corresponding task.
On my platform, all the
RawMutex
implementation actually entails is disabling/re-enabling interrupts on lock/unlock, which is just a couple of instructions
I suppose the global allocator does the same before/after allocations/deallocations because it internally uses locks as well? Could you use the same approach to prevent interrupts during queue operations?
The strategy you mention for mpsc
sounds good to me! But SegQueue
should work in that case just as well, too.
Hi! I'm working on a
Future
s executor forno_std
environments and was looking for a lock-free queue when I foundcrossbeam-deque
, which is unfortunately notno_std
.It seems to me that it's possible to add variants of all of the methods that call
epoch::pin
andepoch::is_pinned
functions which instead take an explicit&LocalHandle
argument to do the pinning. The original methods could then use theno_std
variants under the hood via thecrossbeam_epoch::with_handle
function. The only other thing I see that has anstd
requirement isBackoff::snooze
which callsthread::yield_now
. Maybe its body could just callspin
if compiled withoutstd
?I've got this pretty much all implemented locally and would be happy to clean it up and submit a PR, but I have some API/documentation questions.
std
/no_std
environments or to have them unified and instead use*_with_handle
methods next to the original ones?std
/no_std
could also mean duplicating documentation. Would it instead be better to simply link to theno_std
methods from theirstd
counterparts so that we wouldn't have two places to update descriptions/examples?