MadryLab / DsDm

37 stars 2 forks source link

How do you compute the gradient projection? #1

Open aztec1900 opened 4 months ago

aztec1900 commented 4 months ago

Impressive work on the innovative data selection method! I recently finished reading your paper. I'm particularly curious about the computation of the gradient projection. In your paper, you mentioned using a 125M model and reducing the gradient dimension to 16384. Does this imply the need to store a 125M x 16384 = 2048G projection matrix? That seems impractical considering memory constraints. Even if one could generate the random projection matrix on-the-fly, the computational cost for projection would still be substantial. However, your paper suggests that the projection cost is only 1% of the forward-backward process. I find this aspect a bit confusing. Could you provide some information on this matter? Thank you very much!

yuzc19 commented 3 months ago

Hi @aztec1900, did you make some progress on this issue? I am also very interested in it, but I didn't find the code to estimate datamodels in this repo.