Open garfieldnate opened 5 years ago
Thanks for asking this interesting question. I'm not affiliated with the paper but here's a plot that I think helps me understand the problem: it shows the history_seen
column of the data (right+wrong
, the total number of times a word has been seen) versus the p_recall
column (the per-session score—for other readers as confused as I initially was, note that Duolingo's quizzes can consist of reviewing the same word multiple times in one quiz or session), for the first million rows of the data:
Even assuming that history_seen > 1000
are fake or erroneous (or who knows, maybe real), do note how wide the range of p_recall
is at high values of history_seen
: even if you've reviewed a word hundreds of times, sometimes you're going to typo, or forget, or mess up somewhere else in the sentence and be penalized for this word. Now, I'd guess that the only way the regression can balance this wide range of p_recall
is fitting near-zero weights for the right
and wrong
features.
This would suggest that the bulk of the regression's predictive power comes from the bias term, which the authors allude to in Table 2's last row, which gives the mean-absolute-error in per-session score for a dumb predictor that just always returned the average score of 0.859 ($\bar p$). This dumb estimator outperforms every other estimator except HLR/HLR-lex, which might help convince us of the above analysis.
If this understanding is correct, it could imply that optimizing for weights that drive error in
both to zero is problematic.
If I were to hazard a guess, you'd really want to be Bayesian about this: after potentially hundreds (thousands?) of times seeing a word, you have to arrive at some prior about your future performance: if you happen to fail after that, that shouldn't automatically become a major outlier for the optimization to deal with, but rather should be more gracefully be folded into a posterior. (Ebisu tries to do something like this.)
You might argue that with sufficient data, that should automatically happen. Given that the bias term is so much larger than the two features' weights in HLR-lex, it would appear that's not happening. And furthermore, relying on more data to get the "right" answer (in our case, a positive and maybe sizable weight for right
) is also problematic because you're praying for a preponderance of data to fit your expectation of what's right.
Thanks for commenting with so many details! I will be looking into Ebisu :)
@sueyhan previously opened a ticket for this, but then closed it without getting a response.
These are the model weights I get for training without lexical features (
python experiment.py -l settles.acl16.learning_traces.13m.csv.gz
):wrong -0.2245 right -0.0125 bias 7.5365
I do not see how it can be correct that the
right
feature has a negative weight. This will cause the half life to get shorter as a user gets more correct answers, and therefore the model will predict a lower and lower probability of the user getting correct answers.How can this be correct?