Closed PeterSH6 closed 3 years ago
学长的这篇文章已经把主要的思路讲清楚了,我就做一些补充,然后看看能不能总结一下theory part:https://github.com/vycezhong/read-papers/issues/162
There is at most one ongoing checkpoint operation in the system at any point in time.
恢复的时候,要么回滚到上一个checkpoint,要么回滚到上上个checkpoint,因为上一个checkpoint可能还没有做完。
performing a synchronous in-memory copy of the model state from GPU to CPU is expensive due to the increasingly fast compute capabilities of the GPU.
虽然可以做到every-iteration,但是对于某些任务来说overhead仍然过大,every-k-iteration是一个不错的选择,但是如何选择最好的k呢? => Profile
What to profile: the iteration time (Ti), time to perform weight update (Tw ), time to create an in-memory GPU copy (Tg ), time to create an in-memory CPU copy (Tc ), time to write to storage (Ts ), size of checkpoint (m), peak GPU memory utilization (M), and total GPU memory (Mmax ). Based on CheckFreq’s 2phase checkpointing mechanism, the frequency determination algorithm is as shown in Algorithm 1.
系统会不断Online Profile Checkpoint overhead,如果overhead超过了p,那么就会重新计算checkpoint frequency
Websites: [CheckFreq: Frequent, Fine-Grained DNN Checkpointing](CheckFreq: Frequent, Fine-Grained DNN Checkpointing) 感觉是做的很简洁的一篇工作