整体架构

从整体架构上看，持久化的 Actor 内部使用了大量的 FSM 机制。由于需要和其他 Actor（事件存储Actor、快照存储 Actor）交互，因此在外层使用了一个 BehaviorInterceptor 用于转换消息.

def interceptor: BehaviorInterceptor[Any, InternalProtocol] = new BehaviorInterceptor[Any, InternalProtocol] {

  import BehaviorInterceptor._
  override def aroundReceive(
      ctx: typed.TypedActorContext[Any],
      msg: Any,
      target: ReceiveTarget[InternalProtocol]): Behavior[InternalProtocol] = {
    // 将消息转换为内部的包装.
    val innerMsg = msg match {
      case res: JournalProtocol.Response           => InternalProtocol.JournalResponse(res)
      case res: SnapshotProtocol.Response          => InternalProtocol.SnapshotterResponse(res)
      case RecoveryPermitter.RecoveryPermitGranted => InternalProtocol.RecoveryPermitGranted
      case internal: InternalProtocol              => internal // such as RecoveryTickEvent
      case cmd                                     => InternalProtocol.IncomingCommand(cmd.asInstanceOf[Command])
    }
    target(ctx, innerMsg)
  }
  }

  override def toString: String = "EventSourcedBehaviorInterceptor"
}

除此之外，在状态机的过程中，Actor 处于不同的状态时可能会接收到其他状态处理的消息，这部分消息需要 Stash 或者丢弃。

Akka 使用了 BehaviorSetup 存储持久化 Actor 所需的消息，包括一个 StashBuffer 用于暂存其他状态的消息，BehaviorSetup 会在多个状态之间传递，并在 Actor 外部生成，在外部的好处就是当持久化 Actor 在溯源失败时回滚时，仍然能保留 BehaviorSetup 的状态。

akka持久化

Roiocam commented 2 years ago

识别到的性能优化参数

当 Actor 读取快照失败时，可以选择从事件中溯源状态
- 参数：akka.persistence.snapshot-store-plugin-fallback.snapshot-is-optional 默认是 false
当数据库 OverLoad 时，查询快照&事件的时间可能会比较高, 可以通过调大等待查询的时间以及调小查询窗口的时间
- 等待查询时间：akka.persistence.journal.plugin.recovery-event-timeout 默认是 30s
- 开启窗口查询事件: akka.persistence.journal-plugin-fallback.replay-filter.mode 默认是 repair-by-discard-old, 也就是开启
- 调节查询窗口大小: akka.persistence.journal-plugin-fallback.replay-filter.window-size 默认是 100
当数据库过载时，持久化的请求需要的时间比较久，此时 Actor 会暂存那些 Persist 请求，一种方案是做攒批，另一种是调节每个 Actor 内部的暂存大小
- 持久化 Actor 暂存溢出的策略: akka.persistence.typed.stash-overflow-strategy 默认是丢弃
- 持久化 Actor 暂存大小: akka.persistence.typed.stash-capacity 默认是 4086

akka.persistence.typed {
    # 持久化暂存溢出策略
    stash-overflow-strategy = drop / fail
    # 持久化暂存容量
    stash-capacity =  4096 
}
akka.persistence.journal.plugin {
     # 事件溯源恢复的时间
    recovery-event-timeout =  30s
}
akka.persistence {
   # 快照插件容错
   snapshot-store-plugin-fallback {
     snapshot-is-optional = false
   }
}

一些错误的原因

数据库瓶颈

如果是在溯源恢复 Actor 期间时数据库 OverLoad，Akka 在内部做了类似于断路的处理，在数据库过载时

查询快照：在recovery-event-timeout后如果没有得到数据库响应，则取消该 Actor 的溯源请求，并抛出异常
查询事件：如果开启窗口查询，则在recovery-event-timeout的窗口时间内，没有得到指定数量大小的数据库回复，则取消该 Actor 的溯源请求，并抛出异常

Stash 瓶颈

从持久化 Actor 运行时图可以看到，其运行时依赖两个内部的 Stash，分别是处理内部的 Stash 和用户的 Stash。

用户 Stash 目前只有用户主动调用 Effect.stash 才会存储.
内部 Stash 当 Actor 处理非处理状态时, 暂存. 如重放事件时接收到了命令消息

有两种情况会引发 Stash Full 的问题

事件溯源恢复阶段耗时太久：此时不断有流量进入，会导致丢弃掉 capacity 之后的消息。这种情况应该在Actor溯源前 Stash 住，让流量只在溯源之后进来
事件持久化耗时太久：当数据库有显著瓶颈，或者 Actor 本身流量过高时，在持久化的过程中瞬间击垮 Stash。这种情况应该让 Actor 在做持久化的时候将消息 Stash 住

Roiocam / akka-learnning-notes

Akka 持久化 #5

整体架构

识别到的性能优化参数

一些错误的原因

数据库瓶颈

Stash 瓶颈