PeterSH6 / paper-notes

My paper reading notes
1 stars 0 forks source link

FAST '21 | CheckFreq: Frequent, Fine-Grained DNN Checkpointing #10

Closed PeterSH6 closed 3 years ago

PeterSH6 commented 3 years ago

Websites: [CheckFreq: Frequent, Fine-Grained DNN Checkpointing](CheckFreq: Frequent, Fine-Grained DNN Checkpointing) 感觉是做的很简洁的一篇工作

PeterSH6 commented 3 years ago

学长的这篇文章已经把主要的思路讲清楚了,我就做一些补充,然后看看能不能总结一下theory part:https://github.com/vycezhong/read-papers/issues/162

Recovery Guarantees:

There is at most one ongoing checkpoint operation in the system at any point in time.

恢复的时候,要么回滚到上一个checkpoint,要么回滚到上上个checkpoint,因为上一个checkpoint可能还没有做完。

截屏2021-09-08 下午5 00 02 截屏2021-09-08 下午5 01 29

performing a synchronous in-memory copy of the model state from GPU to CPU is expensive due to the increasingly fast compute capabilities of the GPU.

Checkpoint Frequency Policy

虽然可以做到every-iteration,但是对于某些任务来说overhead仍然过大,every-k-iteration是一个不错的选择,但是如何选择最好的k呢? => Profile

What to profile: the iteration time (Ti), time to perform weight update (Tw ), time to create an in-memory GPU copy (Tg ), time to create an in-memory CPU copy (Tc ), time to write to storage (Ts ), size of checkpoint (m), peak GPU memory utilization (M), and total GPU memory (Mmax ). Based on CheckFreq’s 2phase checkpointing mechanism, the frequency determination algorithm is as shown in Algorithm 1.

系统会不断Online Profile Checkpoint overhead,如果overhead超过了p,那么就会重新计算checkpoint frequency

截屏2021-09-08 下午5 31 57