baiwfg2 / awesome-readings

记录看各种文章、论文的心得
2 stars 0 forks source link

RCU #1

Open baiwfg2 opened 3 years ago

baiwfg2 commented 3 years ago

🍅 [1] what is RCU , perfbook 作者 Paul McKenney

http://www.rdrop.com/~paulmck/RCU/whatisRCU.html

RCU-based updaters typically take advantage of the fact that writes to single aligned pointers are atomic on modern CPUs, allowing atomic insertion, removal, and replacement of data items in a linked structure without disrupting readers

在linux kernel 里是有单独的线程去做updater的回收过程(处理directory-entry cache 时)

synchronize_rcu() 只是等待在它调用之前的read-side CS,不用等之后的reader;且不一定在最后一个 read-side CS 完成后立即返回,一方面有调度时延,另一方面批处理会有拖延

Note that the value returned by rcu_dereference() is valid only within the enclosing RCU read-side critical section

即只能在 read lock , read unlock 之间使用

Efficient implementations of the RCU infrastructure make heavy use of batching in order to amortize their overhead over many uses of the corresponding APIs.

内核中使用RCU (read-copy-update) 这种机制来同步updater和reader,避免用锁。

More typical uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt

updater之间还是要同步的(可能用锁、信号量之类的),原语本身仅是在同步updater和reader; 如果不希望updater因调用 synchronize_rcu() 被阻塞,则可以换成 call_rcu() 异步方式,待reader 都离开了CS后,注册的回调被调用(即是update 的second phase,reclaimer 过程);

toy implementations列举了两种,一种是基于锁实现,另一种是classic RCU,借助 run_on(cpu),当reader干完了,自然run_on 也就返回了?

最常见的用途就是并发读写时,类比于reader-writer locking

baiwfg2 commented 3 years ago

http://www2.rdrop.com/users/paulmck/RCU/ 这里有关RCU的论文、文章

上面的链接转向了这里:https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit

2001, paper

http://www2.rdrop.com/~paulmck/RCU/rclock_OLS.2001.05.01c.pdf 路由表更新属read-mostly 的结构?

举了kernel module unloading 的例子

a quiescent state is a point in the code at which it's guaranteed that all previous operations have completed. the end of the grace period is detected via quiescent states

https://www.usenix.org/legacy/publications/library/proceedings/usenix03/tech/freenix03/full_papers/arcangeli/arcangeli_html/node5.html

Without special action in the deletion code, the search code would be prone to races with those deletions, described in Section 2.1. To handle such race conditions, the update side performs the update in two phases as follows: (1) remove permanent-variable pointers to the item being deleted, and (2) after a grace period has elapsed, free up the item's memory.

The grace period is not a fixed time duration, but is instead inferred by checking for per-CPU quiescent states, such as context switches in non-preemptive environments. Since kernel threads are prohibited from holding locks across a context switch, they also prohibited from holding pointers to data structures protected by those locks across context switches-after all, the entire data structure could well be deleted by some other CPU at any time this CPU does not hold the lock.

A trivial implementation of RCU in a non-preemptive kernel could simply declare the grace period over once it observed each CPU undergoing a context switch. Once the grace period completes, no CPU references the deleted item, and no CPU can gain such a reference. It is therefore safe to free up the deleted item.

With this approach, searches already in progress when the first phase executes might (or might not) see the item being deleted. However, searches that start after the first phase completes are guaranteed to never reference this item. Efficient mechanisms for determining the duration of the grace period are key to RCU, and are described fully elsewhere [McK02a]. These mechanisms track ``naturally occurring'' quiescent states, which removes the need for adding expensive context switches. In addition, these mechanisms take advantage of the fact that a single grace period can satisfy multiple concurrent updates, amortizing the cost of detecting a grace period over the updates

RCU 适合场景条件:

  1. 可以将update 拆成两个阶段
  2. 允许操作stale data
  3. 破坏性更新很少见 这些情况在OS 中很常见

traditional locking is limited by worse-case memory latency. 因为获取锁意味着要写锁结构,这样cache line就不断失效,下次获取锁就得remote acccess

RCU一个问题是可能需要维护一定量的等待free的内存。

另外,perfbook 第9章 Deferred Processing, Michael Scott 的 Shared-Memory Synchronization 6.3节,都谈到了 RCU

baiwfg2 commented 3 years ago

[3] RCU 最新进展,Paul, 2014.11

https://lwn.net/Articles/619355/

academics, unlike large open-source projects, are unconstrained by the need to develop production-quality code. This allows them to try crazier and more experimental ideas.

他表示自己一个不寻常爱好就是偶尔看看业界的学术论文,会尤其关注RCU方面。作者表示之所以学术界有许多Idea,部分原因是因为学者不需要为生产代码考虑太多,他们没有包袱,并且说发现新的事物的一个前提是你要有 insanity.

他列举了大量学者的工作,很多发表在LWN上。有一个 Masstree 一定看看;

上海交大做了一个RCU版本,用以优化reader-writer lock,作者很关注这篇论文,提出了五个问题,看!

This is important because traditional reader-writer lock implementations, such as the Linux kernel's rwlock_t, do not scale well. The reason for this is that all readers must atomically update the single memory location that represents the rwlock_t, both with entering and when leaving their read-side critical sections. The cache line containing this memory location will shuttle among all of these CPUs. Each access to this memory location will therefore result in a cache miss, which will in turn result in the accessing CPU stalling for hundreds or even thousands of clock cycles.

作者对如何发现新的问题和idea,提出了很有价值的建议!

💯 [4] 近一年关于RCU的业界进展, Paul, 2015.12

https://lwn.net/Articles/667593/

问:RCU 是如何保证consistent snapshot of the data structure ?

Answer: It turns out that RCU does in fact guarantee a consistent snapshot—at zero cost—in some important special cases. Perhaps the most common case is a data structure where updaters are restricted to adding new elements or removing old ones, in other words, where in-place updates are prohibited. If each reader only ever looks up a single element, then readers will automatically see a consistent snapshot of that single element—even if a given reader looks up a different element each time it executes. There are many other use cases where RCU provides consistent snapshots for free, and quite a few of them may be found in the Linux kernel. However, it is also the case that consistency guarantees are overrated. After all, the finite speed of light sharply limits the degree of consistency that a given data structure can usefully maintain with its corresponding external real-world state. Given that the whole point of many data structures is to track external state, internal consistency all too often becomes nothing more than an expensive fiction.

recent linux-kernel RCU changes, 2022

http://www2.rdrop.com/~paulmck/RCU/lsfmm-rcu.2022.05.02a-full.pdf

baiwfg2 commented 3 years ago

[5] http://www2.rdrop.com/users/paulmck/RCU/linuxusage.html

用图表来显示 RCU 原语在kernel 中的使用程度,同时也对比了锁原语。过去几年一直在增长,不过不致于超过locking的使用,毕竟使用正确的工具解决问题才是对的

baiwfg2 commented 3 years ago

🐒 [6] Fedor Pikus 的RCU分享

https://www.youtube.com/watch?v=rxQ5K9lo034 CppCon 2017: Fedor Pikus “Read, Copy, Update, then what? RCU for non-kernel programmers”

if you want to really stress-test your code, the more hardware you can throw at it, the easier it it to see what you're doing poorly

lockfree 适用于读写都有、读写量没有什么讲究的一般场景,但是放在读多写少的场景,lockfree 不是good tool

他做了性能比较,如果用pthread rw lock,哪怕一直在读,没有写发生,读上也会有大的开销,因为它一直在尝试防止写进来造成data race

memory reclamation in RCU is a kind of user-driven GC

介绍了一种user-space RCU算法

给每次update 赋一个整形generation/epoch,每次的读都是读它当前的generation,一旦读完,就可以被释放或者放入此generation对应的garbage queue中 image

当修改全局ref_count[generation] 代价大时,可以为每个reader_id 创建ref_count[] , 且要保证 refcount[reader_id][cg] padded to cache line

各种synchronization方法的性能对比

image

风险

reader 要是一直持有某个generation 的handle 怎么办? 如果writer force reclaim,可能造成reader 处于 undefined state,reader最好能识别出不一致,比如查看handle 是否expired,如果是则 drop everything in that critical section and redo it ; 如果写太快,reclaimer thread来不及处理,可能需throttle update; 不能接受的话,可以在某些地方要放弃 seq consistence比如采用probablistic queue

适用场景

image

baiwfg2 commented 3 years ago

今年最新的一次分享: http://www2.rdrop.com/users/paulmck/RCU/Validation.2021.03.24a.pdf

没找到视频

🍅 [7] RCU use cases in perfbook

perfbook 讲了一些 有用在:红黑树, resizable hash table, linue kernel's VM subsystem to reduce 读竞争 RCU-protected Masstreee that combines ideas from B+ trees and tries , paper 说比mongodb, voltdb, redis 都有性能优势 Samuel Madden 团队将Masstree 应用于 Silo , 700K trx/s

RCU 易于使用不是没有代价的,它有unbounded memory footprint. linux kernel 里将 RCU 和引用计数联合起来使用以区分short-lived reference 和 longer-lived reference

baiwfg2 commented 2 years ago

📙 [8] User-Level Implementations of Read-Copy Update 用户态实现RCU, 2012 paper

http://www.rdrop.com/users/paulmck/RCU/urcu-main-accepted.2011.08.30a.pdf

baiwfg2 commented 2 years ago

[9] 大佬博士论文

http://www2.rdrop.com/users/paulmck/RCU/RCUdissertation.2004.07.14e1.pdf

baiwfg2 commented 2 years ago

[10] 中兴大佬谢宝友写了一系列RCU文章

他是 Paul 书的译者

https://mp.weixin.qq.com/s?__biz=Mzg2OTc0ODAzMw==&mid=2247501880&idx=1&sn=8202d240f75fa4015e85aeeca7e15738&source=41#wechat_redirect

baiwfg2 commented 2 years ago

🚋 [11] 我在mysql 里发现了 RCU 应用,在twitter 请教大佬们

https://twitter.com/baiwfg2/status/1517106663802961925

👧 [12] 大佬推荐入门

https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries/

global agreement is expensive. Light speed is finite

由于光速有限,CPU之间通信总是有延时 然后他列举了一系列其他影响光速传播的方面,比如cache coherence, electronic issues (core 比其它组件要快得多), heat dissipation, power distribution

read most 场景(stale or inconsistent data are ok):networking routing algorithm read most场景(需要一致data) : linux dentry cache

频繁用于linked data structure

在他演示的set 方法里,是可以用 atomic exchange 来执行更新的

void set(int cur_a, int cur_b)
{
    struct myconfig *mcp = kmalloc();
    mcp->a = cur_a;
    mcp->b = cur_b;
    mcp = xchg(&curconfig, mcp);
    synchronize_rcu();
    kfree(mcp);
}

RCU 综合运用了两种synchronization, temporal sychronization, spatial sychronization

最后他的性能比较发现,哪怕是在single cpu上,RCU 都比 read-write lock 要好 critical section 越短对RCU 越有利

后面讲的 phased state change ,没太听懂,涉及到较多CPU,编译器 reorder

列举了在v5.15内核使用RCU的模块 讲到了如何调试kernel 里的RCU代码

I might have a warm spot in my heart for RCU, but I have an even warmer spot in my heart for using the right tool for the job

尽管很推崇RCU,但是也要看场景是否适合!

in order to make good use of RCU you have to change the way you think about your problem. That's the hard part

有人在做user-space RCU,像 qemu (https://lists.gnu.org/archive/html/qemu-devel/2013-08/msg02055.html), folly就有,c++ 方面也有提案

use cases

https://linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases/

那个6:30有slide 描述了read lock, read unlock与synchronize_rcu 之间的四种时序情况(PPT好) 10:55 显示cost of global agreement 的代价(与locking比较)

baiwfg2 commented 2 years ago

🌹 [13] kernel doc

https://www.kernel.org/doc/Documentation/RCU/rcu.txt

updater 怎么知道 reader are done ? RCU readers 是不允许block, 切换到user-mode, 或者进入idle loop。如果它们这样了,说明CPU离开了read-side CS

[13.1] linux的RCU 使用地方

http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html

[13.2] userspace RCU

http://liburcu.org/
https://github.com/urcu/userspace-rcu

[14] https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html

💯 [15] What is RCU, Fundamentally? (PPT 材料)

https://lwn.net/Articles/262464/
有些编译器优化是很极端的

内核中的 list_for_each_entry_rcu, hlist_for_each_entry_rcu 相当于是 subscribe

RCU read-side CS 可以nested,但是不要block,(SRCU 允许sleep) 当given CPU 执行了上下文切换,就意味着read-side CS 结束了,synchronize_cpu 可以返回. 但这种情况仅适用于 non-preemtable kernel , 如果是 realtime kernel ,则使用 reference counters来判断

举了link list 的delete , replace 例子,图很好

Quick Quiz 2: What prevents the list_for_each_entry_rcu() from getting a segfault if it happens to execute at exactly the same time as the list_add_rcu()?

Answer: On all systems running Linux, loads from and stores to pointers are atomic, that is, if a store to a pointer occurs at the same time as a load from that same pointer, the load will return either the initial value or the value stored, never some bitwise mashup of the two. In addition, the list_for_each_entry_rcu() always proceeds forward through the list, never looking back. Therefore, the list_for_each_entry_rcu() will either see the element being added by list_add_rcu(), or it will not, but either way, it will see a valid well-formed list.

🔥 [16] What is RCU? Part 2: Usage 从多个角度看待RCU (有性能图,PPT)

https://lwn.net/Articles/263130/ , 2007

RCU的优势在于performance, deadlock immunity, realtime latency

CS duration 加长时,对RCU 也是不利的 可以方便升级 RCU reader 成 RCU updater,而不用担心像 read-write locking 中出现死锁的情况 对很多优先级反转的问题有解决效果 比read-write locking ,reader 可以更早地看到update 的新值

In contrast, with RCU, any reader that begins after the updater completes is guaranteed to see new values, and readers that end after the updater begins might or might not see new values, depending on timing.

一份内核代码将 read-write locking 与 RCU 对比

RCU 是一种受限的引用计数手段,就是在CS 不可block/sleep

However, maintaining a single global reference counter for a large variety of data structures typically results in bouncing the cache line containing the reference count. Such cache-line bouncing can severely degrade performance.

结论

At its core, RCU is nothing more nor less than an API that provides:

  • a publish-subscribe mechanism for adding new data,
  • a way of waiting for pre-existing RCU readers to finish, and
  • a discipline of maintaining multiple versions to permit change without harming or unduly delaying concurrent RCU readers. That said, it is possible to build higher-level constructs on top of RCU, including the reader-writer-locking, reference-counting, and existence-guarantee constructs listed in the earlier sections. Furthermore, I have no doubt that the Linux community will continue to find interesting new uses for RCU, as well as for any of a number of other synchronization primitives.

🐱 [17] RCU part 3: the RCU API , 介绍了linux kernel 里的RCU种类和API

https://lwn.net/Articles/264090/ 2002 年linux kernel 引入RCU

if you are primarily interested in understanding how RCU is used in the Linux kernel, "RCU Classic" would be the place to start, as it is used most frequently.

baiwfg2 commented 2 years ago

[18] Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming , 2010, Paul

https://www.usenix.org/legacy/event/atc11/tech/final_files/Triplett.pdf

提出了 relativistic hash table 的概念,内核当时还没有。之后在3.17版本,由 Thomas Graf 在网络子系统中实现

[18.1] 内核3.17 有了 Relativistic hash table ,使得在resize 时不影响读 (PPT 材料)

https://lwn.net/Articles/612021/ 2014 hash table 大小不是那么好确定,所以resizing 会经常发生,所以对其调优有收益

[18.2] 介绍relativistic hash table 在kernel 中的API

https://lwn.net/Articles/612100/ , 2014

baiwfg2 commented 2 years ago

[19] Scalable Address Spaces Using RCU Balanced Trees

http://people.csail.mit.edu/nickolai/papers/clements-bonsai.pdf

他们将这个应用在 kernel VM 的mmap_sem 上,测试显示有了非常大的提升

[20] RCU 应用在 fs dentry, Peter Zijlstra

刚开始 filesystem data structures were not safe for RCU readers , 之后没有这个限制了(https://lwn.net/Articles/419811/)。 Paul 在 [3] 说大家把linux kernel 当做业界前沿研究的testbed 是值得欢迎的

baiwfg2 commented 2 years ago

[21] Scalable Read-mostly Synchronization Using Passive Reader-Writer Locks , 2014

上海交通大学用RCU变种来优化内核读写锁 作者之一在这里:https://www.zhihu.com/question/31325454/answer/51920170

https://www.usenix.org/conference/atc14/technical-sessions/presentation/liu

受到Paul 的很大关注。在[3] 中提出了五个问题

baiwfg2 commented 2 years ago

[22] linux LWN 文章列表

https://lwn.net/Kernel/Index/#Read-copy-update

baiwfg2 commented 2 years ago

[23] RCU API, 2019

https://lwn.net/Articles/777036/

baiwfg2 commented 2 years ago

[24] RCU high-level API proposals, google 2018

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0561r4.html

baiwfg2 commented 2 years ago

[25] Read-Copy Update (RCU) for C++, Paul, 2016

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0279r1.pdf

P0461R1: Proposed RCU C++ API, 2017 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0461r1.pdf

baiwfg2 commented 2 years ago

[26] linux kernel

include/linux/rcupdate.h 详细注释可看

Qemu kernel

https://github.com/qemu/qemu/blob/master/docs/devel/rcu.txt

lttng

https://github.com/lttng/lttng-ust/tree/master/include/lttng/urcu

baiwfg2 commented 2 years ago

[27] 网友分析评测urcu,写得很好

https://airekans.github.io/c/2016/05/10/dive-into-liburcu @airekans

baiwfg2 commented 2 years ago

[28] user-space RCU

在Mathieu twitter上看到, https://lwn.net/Articles/573424/

[29] URCU-protected hash tables

https://lwn.net/Articles/573431/

关于lock-free RCU-protected resizable hash table有两篇论文

Scalable address spaces using RCU balanced trees, 2012

https://readpaper.com/pdf-annotate/note?noteId=684997739566301184&pdfId=4546223761094500353

使大量并用VM的并发程度有很好的扩展性

  1. 让soft page faults 与mmap, munmap 能够并发执行 (mmap, munmap 还是得serialized)
  2. soft pages faults发生时,不要修改shared cache line,以免失效后各processor 竞争

他们的测试显示性能比stock kernel 有提高

related work

RCU要求一次更新一个指针,这使得更新可以是原子的,但复杂的数据结构不是这样的,所以论文的 BONSAI tree就是一个将RCU应用在复杂数据结构的例子

BONSAI 完全实现只有180 行 ?? 源代码:https://pdos.csail.mit.edu/archive/mosbench/

RadixVM: scalable address spaces for multithreaded applications , 2013

https://readpaper.com/pdf-annotate/note?noteId=685011516958060544&pdfId=4546035191792689153

使mmap, munmap, page faults在non-overlapping memory region下也有很好的扩展性 它们不是linux上做的,因为VM系统 太过复杂,而是在 xv6 内核下做

related work

RadixVM 支持mmap, munmap 并行跑 有个TLB shootdown 概念:当一个线程移除一个region,OS 要加shootdown interrupt给其它处理器。为保守起见,OS都是给所有core 发中断,但其实只需要同共享相同region的core 发即可。 RadixVM 可做到

baiwfg2 commented 2 years ago

[30] Hierarchical RCU

https://lwn.net/Articles/305782/

baiwfg2 commented 2 years ago

[31] preemptible RCU

https://lwn.net/Articles/253651/

In contrast to SRCU, preemptible RCU only permits blocking within primitives that are both subject to priority inheritance and non-blocking in a non-CONFIG_PREEMPT kernel

在 -rt patchset 里实时应用中, 要支持在CS 里支持能被抢占

baiwfg2 commented 2 years ago

perfbook 附录B toy RCU implementation

simple counter-based RCU 可能造成synchronize_rcu 不会返回,造成updater 饥饿

starvation-free counter-based RCU 通过对 rcu_idx 取反,能防止 updater 饥饿

this implementation serializes grace periods, preventing grace-period sharing.

这话什么意思 ?

scalable counter-based RCU read-side 部分扩展性很好 ,但update开销大;需要迭代线程;grace period 不能共享

baiwfg2 commented 2 years ago

[32] Verification of the Tree-Based Hierarchical Read-Copy Update in the Linux Kernel

https://arxiv.org/abs/1610.03052

验证linux kernel 里RCU实现的有效性

baiwfg2 commented 1 year ago

[33] 国内网友文章

https://www.cnblogs.com/LoyenWang/p/12681494.html https://www.cnblogs.com/LoyenWang/p/12770878.html