Open baiwfg2 opened 3 years ago
http://www2.rdrop.com/users/paulmck/RCU/ 这里有关RCU的论文、文章
上面的链接转向了这里:https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit
http://www2.rdrop.com/~paulmck/RCU/rclock_OLS.2001.05.01c.pdf 路由表更新属read-mostly 的结构?
举了kernel module unloading 的例子
a quiescent state is a point in the code at which it's guaranteed that all previous operations have completed. the end of the grace period is detected via quiescent states
Without special action in the deletion code, the search code would be prone to races with those deletions, described in Section 2.1. To handle such race conditions, the update side performs the update in two phases as follows: (1) remove permanent-variable pointers to the item being deleted, and (2) after a grace period has elapsed, free up the item's memory.
The grace period is not a fixed time duration, but is instead inferred by checking for per-CPU quiescent states, such as context switches in non-preemptive environments. Since kernel threads are prohibited from holding locks across a context switch, they also prohibited from holding pointers to data structures protected by those locks across context switches-after all, the entire data structure could well be deleted by some other CPU at any time this CPU does not hold the lock.
A trivial implementation of RCU in a non-preemptive kernel could simply declare the grace period over once it observed each CPU undergoing a context switch. Once the grace period completes, no CPU references the deleted item, and no CPU can gain such a reference. It is therefore safe to free up the deleted item.
With this approach, searches already in progress when the first phase executes might (or might not) see the item being deleted. However, searches that start after the first phase completes are guaranteed to never reference this item. Efficient mechanisms for determining the duration of the grace period are key to RCU, and are described fully elsewhere [McK02a]. These mechanisms track ``naturally occurring'' quiescent states, which removes the need for adding expensive context switches. In addition, these mechanisms take advantage of the fact that a single grace period can satisfy multiple concurrent updates, amortizing the cost of detecting a grace period over the updates
RCU 适合场景条件:
traditional locking is limited by worse-case memory latency. 因为获取锁意味着要写锁结构,这样cache line就不断失效,下次获取锁就得remote acccess
RCU一个问题是可能需要维护一定量的等待free的内存。
另外,perfbook 第9章 Deferred Processing, Michael Scott 的 Shared-Memory Synchronization
6.3节,都谈到了 RCU
https://lwn.net/Articles/619355/
academics, unlike large open-source projects, are unconstrained by the need to develop production-quality code. This allows them to try crazier and more experimental ideas.
他表示自己一个不寻常爱好就是偶尔看看业界的学术论文,会尤其关注RCU方面。作者表示之所以学术界有许多Idea,部分原因是因为学者不需要为生产代码考虑太多,他们没有包袱,并且说发现新的事物的一个前提是你要有 insanity.
他列举了大量学者的工作,很多发表在LWN上。有一个 Masstree 一定看看;
上海交大做了一个RCU版本,用以优化reader-writer lock,作者很关注这篇论文,提出了五个问题,看!
This is important because traditional reader-writer lock implementations, such as the Linux kernel's rwlock_t, do not scale well. The reason for this is that all readers must atomically update the single memory location that represents the rwlock_t, both with entering and when leaving their read-side critical sections. The cache line containing this memory location will shuttle among all of these CPUs. Each access to this memory location will therefore result in a cache miss, which will in turn result in the accessing CPU stalling for hundreds or even thousands of clock cycles.
作者对如何发现新的问题和idea,提出了很有价值的建议!
https://lwn.net/Articles/667593/
问:RCU 是如何保证consistent snapshot of the data structure ?
Answer: It turns out that RCU does in fact guarantee a consistent snapshot—at zero cost—in some important special cases. Perhaps the most common case is a data structure where updaters are restricted to adding new elements or removing old ones, in other words, where in-place updates are prohibited. If each reader only ever looks up a single element, then readers will automatically see a consistent snapshot of that single element—even if a given reader looks up a different element each time it executes. There are many other use cases where RCU provides consistent snapshots for free, and quite a few of them may be found in the Linux kernel. However, it is also the case that consistency guarantees are overrated. After all, the finite speed of light sharply limits the degree of consistency that a given data structure can usefully maintain with its corresponding external real-world state. Given that the whole point of many data structures is to track external state, internal consistency all too often becomes nothing more than an expensive fiction.
http://www2.rdrop.com/~paulmck/RCU/lsfmm-rcu.2022.05.02a-full.pdf
用图表来显示 RCU 原语在kernel 中的使用程度,同时也对比了锁原语。过去几年一直在增长,不过不致于超过locking的使用,毕竟使用正确的工具解决问题才是对的
https://www.youtube.com/watch?v=rxQ5K9lo034 CppCon 2017: Fedor Pikus “Read, Copy, Update, then what? RCU for non-kernel programmers”
if you want to really stress-test your code, the more hardware you can throw at it, the easier it it to see what you're doing poorly
lockfree 适用于读写都有、读写量没有什么讲究的一般场景,但是放在读多写少的场景,lockfree 不是good tool
他做了性能比较,如果用pthread rw lock,哪怕一直在读,没有写发生,读上也会有大的开销,因为它一直在尝试防止写进来造成data race
memory reclamation in RCU is a kind of user-driven GC
给每次update 赋一个整形generation/epoch,每次的读都是读它当前的generation,一旦读完,就可以被释放或者放入此generation对应的garbage queue中
当修改全局ref_count[generation] 代价大时,可以为每个reader_id 创建ref_count[] , 且要保证 refcount[reader_id][cg] padded to cache line
reader 要是一直持有某个generation 的handle 怎么办? 如果writer force reclaim,可能造成reader 处于 undefined state,reader最好能识别出不一致,比如查看handle 是否expired,如果是则 drop everything in that critical section and redo it ; 如果写太快,reclaimer thread来不及处理,可能需throttle update; 不能接受的话,可以在某些地方要放弃 seq consistence比如采用probablistic queue
今年最新的一次分享: http://www2.rdrop.com/users/paulmck/RCU/Validation.2021.03.24a.pdf
没找到视频
perfbook 讲了一些 有用在:红黑树, resizable hash table, linue kernel's VM subsystem to reduce 读竞争 RCU-protected Masstreee that combines ideas from B+ trees and tries , paper 说比mongodb, voltdb, redis 都有性能优势 Samuel Madden 团队将Masstree 应用于 Silo , 700K trx/s
RCU 易于使用不是没有代价的,它有unbounded memory footprint. linux kernel 里将 RCU 和引用计数联合起来使用以区分short-lived reference 和 longer-lived reference
http://www.rdrop.com/users/paulmck/RCU/urcu-main-accepted.2011.08.30a.pdf
他是 Paul 书的译者
https://twitter.com/baiwfg2/status/1517106663802961925
https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries/
global agreement is expensive. Light speed is finite
由于光速有限,CPU之间通信总是有延时 然后他列举了一系列其他影响光速传播的方面,比如cache coherence, electronic issues (core 比其它组件要快得多), heat dissipation, power distribution
read most 场景(stale or inconsistent data are ok):networking routing algorithm read most场景(需要一致data) : linux dentry cache
频繁用于linked data structure
在他演示的set 方法里,是可以用 atomic exchange 来执行更新的
void set(int cur_a, int cur_b)
{
struct myconfig *mcp = kmalloc();
mcp->a = cur_a;
mcp->b = cur_b;
mcp = xchg(&curconfig, mcp);
synchronize_rcu();
kfree(mcp);
}
RCU 综合运用了两种synchronization, temporal sychronization
, spatial sychronization
最后他的性能比较发现,哪怕是在single cpu上,RCU 都比 read-write lock 要好 critical section 越短对RCU 越有利
后面讲的 phased state change ,没太听懂,涉及到较多CPU,编译器 reorder
列举了在v5.15内核使用RCU的模块 讲到了如何调试kernel 里的RCU代码
I might have a warm spot in my heart for RCU, but I have an even warmer spot in my heart for using the right tool for the job
尽管很推崇RCU,但是也要看场景是否适合!
in order to make good use of RCU you have to change the way you think about your problem. That's the hard part
有人在做user-space RCU,像 qemu (https://lists.gnu.org/archive/html/qemu-devel/2013-08/msg02055.html), folly就有,c++ 方面也有提案
https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases/
那个6:30有slide 描述了read lock, read unlock与synchronize_rcu 之间的四种时序情况(PPT好) 10:55 显示cost of global agreement 的代价(与locking比较)
https://www.kernel.org/doc/Documentation/RCU/rcu.txt
updater 怎么知道 reader are done ? RCU readers 是不允许block, 切换到user-mode, 或者进入idle loop。如果它们这样了,说明CPU离开了read-side CS
http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html
http://liburcu.org/
https://github.com/urcu/userspace-rcu
https://lwn.net/Articles/262464/
有些编译器优化是很极端的
内核中的 list_for_each_entry_rcu
, hlist_for_each_entry_rcu
相当于是 subscribe
RCU read-side CS 可以nested,但是不要block,(SRCU 允许sleep)
当given CPU 执行了上下文切换,就意味着read-side CS 结束了,synchronize_cpu
可以返回. 但这种情况仅适用于 non-preemtable kernel , 如果是 realtime kernel ,则使用 reference counters来判断
举了link list 的delete , replace 例子,图很好
Quick Quiz 2: What prevents the list_for_each_entry_rcu() from getting a segfault if it happens to execute at exactly the same time as the list_add_rcu()?
Answer: On all systems running Linux, loads from and stores to pointers are atomic, that is, if a store to a pointer occurs at the same time as a load from that same pointer, the load will return either the initial value or the value stored, never some bitwise mashup of the two. In addition, the list_for_each_entry_rcu() always proceeds forward through the list, never looking back. Therefore, the list_for_each_entry_rcu() will either see the element being added by list_add_rcu(), or it will not, but either way, it will see a valid well-formed list.
https://lwn.net/Articles/263130/ , 2007
RCU的优势在于performance, deadlock immunity, realtime latency
CS duration 加长时,对RCU 也是不利的 可以方便升级 RCU reader 成 RCU updater,而不用担心像 read-write locking 中出现死锁的情况 对很多优先级反转的问题有解决效果 比read-write locking ,reader 可以更早地看到update 的新值
In contrast, with RCU, any reader that begins after the updater completes is guaranteed to see new values, and readers that end after the updater begins might or might not see new values, depending on timing.
一份内核代码将 read-write locking 与 RCU 对比
RCU 是一种受限的引用计数手段,就是在CS 不可block/sleep
However, maintaining a single global reference counter for a large variety of data structures typically results in bouncing the cache line containing the reference count. Such cache-line bouncing can severely degrade performance.
At its core, RCU is nothing more nor less than an API that provides:
- a publish-subscribe mechanism for adding new data,
- a way of waiting for pre-existing RCU readers to finish, and
- a discipline of maintaining multiple versions to permit change without harming or unduly delaying concurrent RCU readers. That said, it is possible to build higher-level constructs on top of RCU, including the reader-writer-locking, reference-counting, and existence-guarantee constructs listed in the earlier sections. Furthermore, I have no doubt that the Linux community will continue to find interesting new uses for RCU, as well as for any of a number of other synchronization primitives.
https://lwn.net/Articles/264090/ 2002 年linux kernel 引入RCU
if you are primarily interested in understanding how RCU is used in the Linux kernel, "RCU Classic" would be the place to start, as it is used most frequently.
https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf
提出了 relativistic hash table
的概念,内核当时还没有。之后在3.17版本,由 Thomas Graf 在网络子系统中实现
Relativistic hash table
,使得在resize 时不影响读 (PPT 材料)https://lwn.net/Articles/612021/ 2014 hash table 大小不是那么好确定,所以resizing 会经常发生,所以对其调优有收益
http://people.csail.mit.edu/nickolai/papers/clements-bonsai.pdf
他们将这个应用在 kernel VM 的mmap_sem 上,测试显示有了非常大的提升
刚开始 filesystem data structures were not safe for RCU readers , 之后没有这个限制了(https://lwn.net/Articles/419811/)。 Paul 在 [3] 说大家把linux kernel 当做业界前沿研究的testbed 是值得欢迎的
上海交通大学用RCU变种来优化内核读写锁 作者之一在这里:https://www.zhihu.com/question/31325454/answer/51920170
https://www.usenix.org/conference/atc14/technical-sessions/presentation/liu
受到Paul 的很大关注。在[3] 中提出了五个问题
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0561r4.html
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0279r1.pdf
P0461R1: Proposed RCU C++ API, 2017 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0461r1.pdf
include/linux/rcupdate.h
详细注释可看
https://github.com/qemu/qemu/blob/master/docs/devel/rcu.txt
https://github.com/lttng/lttng-ust/tree/master/include/lttng/urcu
https://airekans.github.io/c/2016/05/10/dive-into-liburcu @airekans
在Mathieu twitter上看到, https://lwn.net/Articles/573424/
https://lwn.net/Articles/573431/
关于lock-free RCU-protected resizable hash table有两篇论文
https://readpaper.com/pdf-annotate/note?noteId=684997739566301184&pdfId=4546223761094500353
使大量并用VM的并发程度有很好的扩展性
他们的测试显示性能比stock kernel 有提高
RCU要求一次更新一个指针,这使得更新可以是原子的,但复杂的数据结构不是这样的,所以论文的 BONSAI tree就是一个将RCU应用在复杂数据结构的例子
BONSAI 完全实现只有180 行 ?? 源代码:https://pdos.csail.mit.edu/archive/mosbench/
https://readpaper.com/pdf-annotate/note?noteId=685011516958060544&pdfId=4546035191792689153
使mmap, munmap, page faults在non-overlapping memory region下也有很好的扩展性 它们不是linux上做的,因为VM系统 太过复杂,而是在 xv6 内核下做
RadixVM 支持mmap, munmap 并行跑
有个TLB shootdown
概念:当一个线程移除一个region,OS 要加shootdown interrupt给其它处理器。为保守起见,OS都是给所有core 发中断,但其实只需要同共享相同region的core 发即可。 RadixVM 可做到
https://lwn.net/Articles/253651/
In contrast to SRCU, preemptible RCU only permits blocking within primitives that are both subject to priority inheritance and non-blocking in a non-CONFIG_PREEMPT kernel
在 -rt patchset 里实时应用中, 要支持在CS 里支持能被抢占
simple counter-based RCU 可能造成synchronize_rcu 不会返回,造成updater 饥饿
starvation-free counter-based RCU 通过对 rcu_idx 取反,能防止 updater 饥饿
this implementation serializes grace periods, preventing grace-period sharing.
这话什么意思 ?
scalable counter-based RCU read-side 部分扩展性很好 ,但update开销大;需要迭代线程;grace period 不能共享
https://arxiv.org/abs/1610.03052
验证linux kernel 里RCU实现的有效性
🍅 [1] what is RCU , perfbook 作者 Paul McKenney
http://www.rdrop.com/~paulmck/RCU/whatisRCU.html
在linux kernel 里是有单独的线程去做updater的回收过程(处理directory-entry cache 时)
synchronize_rcu()
只是等待在它调用之前的read-side CS,不用等之后的reader;且不一定在最后一个 read-side CS 完成后立即返回,一方面有调度时延,另一方面批处理会有拖延即只能在 read lock , read unlock 之间使用
内核中使用RCU (read-copy-update) 这种机制来同步updater和reader,避免用锁。
updater之间还是要同步的(可能用锁、信号量之类的),原语本身仅是在同步updater和reader; 如果不希望updater因调用
synchronize_rcu()
被阻塞,则可以换成call_rcu()
异步方式,待reader 都离开了CS后,注册的回调被调用(即是update 的second phase,reclaimer 过程);toy implementations列举了两种,一种是基于锁实现,另一种是classic RCU,借助
run_on(cpu)
,当reader干完了,自然run_on 也就返回了?最常见的用途就是并发读写时,类比于reader-writer locking