Closed ganler closed 3 years ago
There is a very interesting term called no sync window
, which means the period that the activation must be cached from being produced to being consumed (updated).
Distributed patterns: data | model | hybrid parallel
Problem: Compute underutilization
So what we can do to increase the pipeline overlap is to analyze the dependency of data flows & make some priorities / orders in sending the parameters and activation;
Sangeetha classified current research on DNN training acceleration into 3 parts:
Her TicTac [MLSys'19] is for PS; Caramel is for Ring Allreduce;
By Sangeetha Abdu Jyothi
https://www.youtube.com/watch?v=K9DIfGmbPu8