FAST '21 | CheckFreq: Frequent, Fine-Grained DNN Checkpointing

学长的这篇文章已经把主要的思路讲清楚了，我就做一些补充，然后看看能不能总结一下theory part：https://github.com/vycezhong/read-papers/issues/162

Recovery Guarantees:

There is at most one ongoing checkpoint operation in the system at any point in time.

恢复的时候，要么回滚到上一个checkpoint，要么回滚到上上个checkpoint，因为上一个checkpoint可能还没有做完。

Persist()操作和Compute是tightly coupled的，但是Persist()作为一个background process。如果compute中update时persist()没有做完，那么compute需要等待persist()做完。
snapshot()操作中，首先考虑GPU中是否有足够的空间存放snapshot，如果有直接存在GPU中。然后在persist()阶段存在reliable storage中。如果GPU中内存不够，就GPU->CPU，但是这个种情况可能会引起stalls on critical path。

performing a synchronous in-memory copy of the model state from GPU to CPU is expensive due to the increasingly fast compute capabilities of the GPU.

Checkpoint Frequency Policy

虽然可以做到every-iteration，但是对于某些任务来说overhead仍然过大，every-k-iteration是一个不错的选择，但是如何选择最好的k呢？ => Profile

What to profile: the iteration time (Ti), time to perform weight update (Tw ), time to create an in-memory GPU copy (Tg ), time to create an in-memory CPU copy (Tc ), time to write to storage (Ts ), size of checkpoint (m), peak GPU memory utilization (M), and total GPU memory (Mmax ). Based on CheckFreq’s 2phase checkpointing mechanism, the frequency determination algorithm is as shown in Algorithm 1.

系统会不断Online Profile Checkpoint overhead，如果overhead超过了p，那么就会重新计算checkpoint frequency

PeterSH6 / paper-notes

FAST '21 | CheckFreq: Frequent, Fine-Grained DNN Checkpointing #10

Recovery Guarantees:

Checkpoint Frequency Policy