duolingo / halflife-regression

MIT License
494 stars 88 forks source link

Implementation of SGD. #2

Closed musically-ut closed 7 years ago

musically-ut commented 7 years ago

Thanks for making the great app even better and for uploading the code!

I am working on extending the model to allow each lexeme to have more features and have some minor doubts about the SGD implementation in the code:

  1. The learning rate for the hlr case seems to differ from the method given in the appendix of the paper by a factor of 1 / (1 + inst.p). Is there any particular reason for that?

  2. Usually, SGD iterations are repeated until some convergence criteria is met. In the code, however, it seems that only one pass is made on the complete data-set. Was that because empirically you observe that the results converge by then?

Thanks!

burrsettles commented 7 years ago
  1. This appears to be the case, because the data are biased toward high recall rates, it appears that the code (hackily) weights them lower when learning. You can experimentally determine whether this makes any difference (I don't remember if it at this point).

  2. We observed no difference in metrics taking multiple passes over the data. In my experience, SGD+AdaGrad for linear models using very large samples (like this) does a very good job at avoiding overfitting while converging to something near-optimal in a single pass.