fasiha / ebisu

Public-domain Python library for flashcard quiz scheduling using Bayesian statistics. (JavaScript, Java, Dart, and other ports available!)
https://fasiha.github.io/ebisu
The Unlicense
314 stars 32 forks source link

Why is predicted half-life not monotonic for failures as the elapsed time increases? #1

Closed fasiha closed 5 years ago

fasiha commented 7 years ago

Consider the “new half-life” plot from the Ebisu paper:

original

If you change the max time (x axis) to go to 100 instead of stopping at 30, you’ll see that the new half-life starts dropping after ~35 days if the student fails the quiz:

more x-axis

When I saw the original plot, I was pleased because the two “fail” curves appeared to be converging to 7, the original half-life—this makes sense because way past the half-life, there’s very little probability for recall and we shouldn’t be surprised that the new half-life is the same as the old half-life.

But evidently, the “fail” curves aren’t converging to the old half-life, they’re decreasing!

Why?

Is this because of the final step of the update, fitting a Beta distribution to the true posterior? Loss of precision in gammaln or logsumexp?

Update It’s worse than I feared. If you wait an extremely long time before reviewing (300 days, or ~43 half-lives), the new post-update half-life goes to 0, so you’ll be reviewing this fact very quickly.

extreme x-axis

fasiha commented 6 years ago

I checked a bunch of things to identify what it's not (precision issues, rounding issues, etc.) and this script zooms in on what the problem is:

"""
Starting with a default model, alpha=beta=12 and half-life of 7 (days), simulate
a failed quiz after N days, where N runs from 1 to 300 days.

Then predict the recall probability 7 days after the quiz failure, using exact
math:

Expectation[(ptd) ^ d'],

where d' = 7 / N and where ptd is the posterior probability on the recall
probability at N days (after the Bernoulli likelihood update).

N<<7 means you failed the quiz long before the half-life, which is surprising,
so we expect Ebisu to reduce the half-life from 7 days: therefore, for N<<7, we
expect predicted recall probability to be lower than 0.5.

But as N>>7, we expect you to fail, so after you do fail, the updated memory
model should predict a recall probability of 0.5.

This is indeed what we see when we evaluate the above expectation with
quadrature (numerical) integration---that is, until we get too far out, around
280 days, when the integral diverges.

This limit, 280 days, is much higher than what we see through Ebisu's
update->predict plot's limit, where the recall probability starts decreasing at
about 30 days.

"""
from scipy.integrate import quad
from scipy.special import beta as fbeta
import numpy as np

import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.ion()

a = 12.
b = 12.
t = 7.
t2s = np.arange(1, 300.1)

direct = []
for t2 in t2s:
    d = t2 / t
    marginal = d * (fbeta(a, b) - fbeta(a + d, b))
    marginalInt = lambda p: p**((a - d) / d) * (1 - p**(1 / d))**(b - 1) * (1 - p)
    e = quad(lambda p: marginalInt(p) / marginal * p**(7.0 / t2), 0, 1)
    direct.append(e)

integralError = np.array(direct)

plt.figure()
plt.errorbar(t2s, integralError[:, 0], yerr=integralError[:, 1], fmt='o')
plt.title(
    'Predicted recall probability a week after failure (prev half-life: 7 days)'
)
plt.ylabel('Recall probability')
plt.xlabel('Days before initial failure')
plt.savefig('errors.png')

errors

So as I mention in the script's docstring above, directly evaluating the predicted recall probability a week after the failed quiz is accurate to far more days than using ebisu.updateRecall and then predictRecall, which starts decreasing after about 30 days.

The issue is just that, with this extreme under-review, the Beta fit to the posterior doesn't give us good predictions. There are mathy ways to potentially fix this, BUT I think the simplest and most robust thing is just a hack: if you're updating after more than, say, three to four times the previous prior's time interval, with a failure, then skip the math—the updater can just return a copy of the prior. This is a hack because we're not explicitly addressing the math but it gets the job done really well and it makes total sense given we understand the problem.

Todo: implement this hack.

eshapard commented 5 years ago

Sounds reasonable to me. 👍

fasiha commented 5 years ago

Fixed in 1.0.