1987 | Sender-Based Message Logging

PeterSH6 commented 3 years ago

PDF: https://www.cs.rice.edu/~dbj/pubs/ftcs87-sbml.pdf

PeterSH6 commented 3 years ago

这篇文章的主要思想是：将sender发送的message，log到volatile memory中，然后再asynchronously将这些message log存在stable storage中。这样critical-path中不会涉及磁盘访问，overhead比较小。但是该方法在同一时间段内，只能恢复一个Failed进程，如果出现多个Failed进程将无法恢复。

Assumption:

fail-stop processors
类似于TCP协议，网络不是guaranteed，但是可以允许丢包重传
网络能够支持broadcast（用于恢复的时候replaced processor通过broadcast来通知上游结点进行重传log）
有一个能被系统中所有进程访问的network-wide stable storage（不知道拿来干嘛的）
进程之间只通过message来通信（actor模型符合，不过其他很多模型也是这样）
Processes are deterministic: 给定输入和初始状态，输出一样（确定性的）

Motivation

传输message的过程使得sender和receiver都有一个message的备份，然而直接将这个message存在内存中不需要额外的开销，同时logging是为了恢复receiver。同时存在sender中免除了centralized log的问题。同时，receiver在接收了sender的message之后，receiver自己会维护一个RSN（receive sequence number），同时把这个RSN发送给sender作为一个ACK。（之后sender还需要发送一个ACK来确认收到这个RSN，这样每个sender中就知道发送的顺序，可以解决 #3 中的non-deterministic的问题）

PeterSH6 commented 3 years ago

Data Structures

实现sender-based message logging需要存储的一些数据结构

SSN (send sequence number). A sequence number of messages sent by the process. 用来减少恢复过程中重复的msg
RSN (receive sequence number). A sequence number of messages received by the process. Receiver中的RSN会自增，同时返回给sender用于标记刚刚发送的msg
A message log of messages sent by the process. 在sender中存储的信息包括：整个msg，receiver的id，该msg的SSN，receiver回复的RSN。当一个结点checkpoint之后，上游的sending process可以将这些信息删除
A table recording the highest SSN value received in a message sent by each process with which this process has communicated. This is used for duplicate message detection. 存一个当前process发送的最高SSN（不太懂）
A table maintaining the RSN value that was returned for each message received since the last checkpoint of this process. This table is indexed by the SSN of the message and may be purged when the process is checkpointed. 这个表中msg是以SSN为索引，包括RSN的信息。同时，除了5，1-4在checkpoint的时候均需要被保存。这样如果自己挂掉了的时候，可以restore这些存储的信息，恢复好之后，如果下游挂了，可以直接发送。但是如果上游挂了还没恢复好，下游就挂了，那么这个时候下游就无法恢复了。

PeterSH6 commented 3 years ago

Main Protocols

logging过程不是atomic，分成了三个过程。有两种log：Fully logged msg和Partial Logged msg。Fully表示msg和RSN都保存了。Partial表示只保存了msg，没有RSN（可能是receiver挂掉了但sender已经发送，因此没有生成RSN）。

Msg Logging Protocol

X发送msg给Y，同时X将msg存在内存中的msg log中
Y收到msg，自增RSN，将RSN附在ACK中发回给X
X收到ACK(RSN)中，发送ACK给Y表示收到RSN。（这个相比其他protocol的extra one)

Y在收到X的ACK之前不能向其他进程发送信息，这样做是为了防止orphan的出现，后面会讨论。但是不会阻塞它继续execute(compute)，如果我们不需要收到msg之后马上send（收和发可能没有逻辑关系，只是时间上相近），那么就不必担心这个地方带来的overhead。所以需要思考一下pipeline中是否能保证两者有一定的时间差。

实现上的Trick：如果使用RPC来进行通信，那么就不会有显示的ack packet。但是我们可以用RPC的返回值来返回RSN，之后再用一个RPC（它的返回值）来传递对于RSN的ACK。

截屏2021-09-23 下午11 35 19

Failure Recovery Protocol

Replace a processor and load checkpoint then broadcast request for its logged msg. 这里可以不用recovery process自己broadcast，可以用agent来指导恢复
Resent fully logged msg based on RSN。会从checkpoint中的RSN开始，执行顺序和SSN一致。recovery process在接收过程中，如果发现收到的SSN小于或者等于当前收到的最高SSN，那说明msg出现重复，即丢弃
Resent partial logged msg in any order

最后一部分没太明白

If the receiver has not checkpointed since originally receiving this message, it returns an acknowledgement including the RSN that it assigned when it rst received this message. This RSN value is retrieved from its table recording the correspondence between the SSN of each message received and the RSN value assigned to that message. However, if the receiving process has been checkpointed since this message was rst received, this table entry will have been purged, and an indication that this message need not be logged at the sender is returned instead.

Correctness

Fully Log的顺序和正常执行时是完全一样的，所以correct。对于partial log，可能顺序会出现不一致，但是也仅仅会影响receiver（recover process），但是不会影响到其他系统中的process。

同时一个进程在恢复的过程中，会resend在checkpoint和failure这段时间时间之内的msg给自己的下游，下游可以根据SSN来判断是否是重复的信息。如果是重复的，下游会自动丢弃。恢复好的进程的内存中，会恢复之前store的下游msg log。这些log其实就是刚刚在恢复的时候发送给下游的msg。如果此时下游的结点在这些msg之后已经做了checkpoint，那么下游结点会给sender发送一个indicator来表示该msg log可以删除。但是，从这个角度来看，下游做了checkpoint之后并没有马上发送一个indicator给上游以便让上游清理msg log，而是等到上游挂了需要恢复的时候才这样做（可能内存溢出之后，就算挂了，然后开始恢复，恢复过程中清理，这样也太pessmistic了）。可以考虑下游做了checkpoint就通知上游清理吧，而且既然是全局的checkpoint，那么就由全局的agent来负责garbage collection吧。

最后文章提到该方法不会产生domino effect

An Optimistic Alternative

receiver在收到一个msg之后可以马上send自己的msg给下游，不需要等待。 Assumption：log能在failure发生之前结束，并且能够获得足够多的信息来使得系统回滚到一个consistent state

但是在这种情况下，可能会出现orphan：X收到一个信息M，但是此时X还没有发送RSN或者RSN还没有发送成功给X的上游，即M的RSN还没有被上游Log。此时X马上发送一个消息N给Y。之后X出现failure。因为X的上游没有保存M的RSN，X可能无法恢复到发送N之前的状态，那么Y就变成了一个orphan

截屏2021-09-24 上午9 55 42

文中说了一个orphan detect的机制。不详细解释了：

If the current RSN of a process is included in all messages sent by the process, and if each process maintains a table of the highest RSN it has received from any process, process Y has become an orphan from the failure of process X , if the value for X in its RSN table is higher than the RSN to which X was able to recover from the sequence of fully logged messages.

orphan-detect被调用的时间一般是checkpoint之前，如果此时有orphan，那么应该先处理orphan再进行checkpoint

如何处理orphan呢？假如识别了多个orphan，那么一次fail一个orphan来恢复。恢复之后会自动到一个consistent state

PeterSH6 commented 3 years ago

Related Work

16

PeterSH6 / paper-notes