duolingo / halflife-regression

MIT License
494 stars 88 forks source link

negative weight for "right" feature #7

Open garfieldnate opened 5 years ago

garfieldnate commented 5 years ago

@sueyhan previously opened a ticket for this, but then closed it without getting a response.

These are the model weights I get for training without lexical features (python experiment.py -l settles.acl16.learning_traces.13m.csv.gz):

wrong -0.2245 right -0.0125 bias 7.5365

I do not see how it can be correct that the right feature has a negative weight. This will cause the half life to get shorter as a user gets more correct answers, and therefore the model will predict a lower and lower probability of the user getting correct answers.

How can this be correct?

fasiha commented 4 years ago

Thanks for asking this interesting question. I'm not affiliated with the paper but here's a plot that I think helps me understand the problem: it shows the history_seen column of the data (right+wrong, the total number of times a word has been seen) versus the p_recall column (the per-session score—for other readers as confused as I initially was, note that Duolingo's quizzes can consist of reviewing the same word multiple times in one quiz or session), for the first million rows of the data:

million

Even assuming that history_seen > 1000 are fake or erroneous (or who knows, maybe real), do note how wide the range of p_recall is at high values of history_seen: even if you've reviewed a word hundreds of times, sometimes you're going to typo, or forget, or mess up somewhere else in the sentence and be penalized for this word. Now, I'd guess that the only way the regression can balance this wide range of p_recall is fitting near-zero weights for the right and wrong features.

This would suggest that the bulk of the regression's predictive power comes from the bias term, which the authors allude to in Table 2's last row, which gives the mean-absolute-error in per-session score for a dumb predictor that just always returned the average score of 0.859 ($\bar p$). This dumb estimator outperforms every other estimator except HLR/HLR-lex, which might help convince us of the above analysis.

If this understanding is correct, it could imply that optimizing for weights that drive error in

both to zero is problematic.

If I were to hazard a guess, you'd really want to be Bayesian about this: after potentially hundreds (thousands?) of times seeing a word, you have to arrive at some prior about your future performance: if you happen to fail after that, that shouldn't automatically become a major outlier for the optimization to deal with, but rather should be more gracefully be folded into a posterior. (Ebisu tries to do something like this.)

You might argue that with sufficient data, that should automatically happen. Given that the bias term is so much larger than the two features' weights in HLR-lex, it would appear that's not happening. And furthermore, relying on more data to get the "right" answer (in our case, a positive and maybe sizable weight for right) is also problematic because you're praying for a preponderance of data to fit your expectation of what's right.

garfieldnate commented 4 years ago

Thanks for commenting with so many details! I will be looking into Ebisu :)