How do you compute the gradient projection?

Impressive work on the innovative data selection method! I recently finished reading your paper. I'm particularly curious about the computation of the gradient projection. In your paper, you mentioned using a 125M model and reducing the gradient dimension to 16384. Does this imply the need to store a 125M x 16384 = 2048G projection matrix? That seems impractical considering memory constraints. Even if one could generate the random projection matrix on-the-fly, the computational cost for projection would still be substantial. However, your paper suggests that the projection cost is only 1% of the forward-backward process. I find this aspect a bit confusing. Could you provide some information on this matter? Thank you very much!

MadryLab / DsDm

How do you compute the gradient projection? #1