A probability distribution on the halflife itself?

fasiha / ebisu

Public-domain Python library for flashcard quiz scheduling using Bayesian statistics. (JavaScript, Java, Dart, and other ports available!)

https://fasiha.github.io/ebisu

The Unlicense

314 stars 32 forks source link

A probability distribution on the halflife itself? #32

Closed ttencate closed 4 years ago

ttencate commented 4 years ago

First off, thank you for reading my ramblings thus far! I'm learning something new every day.

This isn't really an issue with Ebisu, more of a discussion topic, but I'm opening it as an issue so it can be preserved for posterity and easily linked to if this topic comes up again.

My thought was: instead of assuming a beta distribution at a particular time t, could we assume a distribution on the halflife itself? I haven't yet tried to do the math, because maybe you've tried this already and it leads nowhere, so it seemed best to discuss it first.

The advantage would be that predictRecall could simply take the expected value of the halflife distribution and boil down to a very cheap return 0.5**(tnow / h), which can even be done in SQL if needed.

What would such a distribution look like? We need something whose "mass peak" can be shifted around arbitrarily to represent the expected halflife, and whose "width" can be tuned to represent the certainty. We can do both of these things with a normal (Gaussian) distribution, as in h ~ N(mu, sigma), but the problem is that it has a nonzero tail to the left of the y axis. Really our distribution should only exist for t >= 0 and be zero at t = 0.

So how about log(h) ~ N(log(mu), sigma)? Technically it's undefined for h = 0 but it converges to 0 when approached from h > 0, so I can live with that. It seems to have the desired properties:

2020-06-16-101553_1151x538_scrot

This is how far I got. Here are the things I'm concerned about, which an experienced statistician could probably answer easily:

Can we analytically transform this into a distribution on h instead of log(h)?
The expected value of log(h) is obviously log(mu), but the expected value of h is probably not mu, as the tail is skewed to the right. So maybe predictRecall still needs to do some more work to compute the actual 0.5**(tnow / h) from the parameters mu, sigma. Or maybe we can re-parameterize the distribution so that one of the parameters is equal to the expected value.
I have no idea how feasible it is to calculate the posterior in updateRecall given some observation (quiz result) using Bayes' rule. For all I know, the reason people use all these well-known distributions is that they're the only ones that can be dealt with analytically.

ttencate commented 4 years ago

Tried this law to compute E[h] but my integration skills are sorely lacking, and Wolfram Alpha just gives me timeouts. It's of the form integral e^(ax^2 + bx + c) dx.

fasiha commented 4 years ago

Nice work! Explicitly tracking the halflife is what Duolingo's halflife regression does! Read their paper here: https://github.com/duolingo/halflife-regression/blob/master/settles.acl16.pdf and they also have a blog post about it. It needs machine learning to update halflives (and you'll see in that repository's issues where I, along with others, have described serious concerns with their training algorithm to do that).

When I began work on Ebisu I wanted to see if we could do something like halflife regression purely Bayesianly (I talk a bit about this in my blog post about Ebisu—I mention considering the Gamma distribution for halflife), and it turned out that it was easy enough to just put a prior on recall at some time, and then deterministically put that through the exponential to get the prior at any given time.

Now I'm intrigued though: log(h) when h is normally distributed has a well-known distribution: https://en.wikipedia.org/wiki/Log-normal_distribution. I'd have to sit down with some pens and paper to see what the posterior (after a quiz) would look like if you model its Bernoulli odds as the output of exp(-tnow/h), that might turn out to be tractable—and even if not, we should make a note of it somewhere so others know we looked 😊.

A concern here is, would different variances on h lead to different predictRecall? That is, if I was confident in h for one quiz but tentative for another, would the uncertainty change my prediction of recall for the two quizzes? (It does in Ebisu, which is kind of nice—compare predictRecall((5,5, 1), 5.) vs predictRecall((2,2, 1), 5.))

A side note though. I do somewhat disparage this in the Ebisu doc but if we store model halflife, then you could indeed replace predictRecall with exp as you suggested, that's likely a totally reasonable approximation for your true goal, which is to rank the quizzes. You could also do something related: calculate the output of predictRecall over ten logarithmically-spaced times into the future (one hour, 6 hours, one day, 2 days, one week, two weeks, one month, etc.) and then use SQL to interpolate between them.

I'll look more later.

ttencate commented 4 years ago

And here I am, discussing statistics while I could (and should) be working on my quiz app instead... but it's just way too dang interesting :smile:

Thank you for the (another) in-depth reply, and especially the link to your old blog post, which I hadn't seen before. So this ground has indeed been trodden before... although not with a log-normal distribution perhaps? I played around with a gamma distribution just now too; at first I thought we couldn't position and scale it the way we needed, but we can; the usual parameters are just a bit unintuitive to work with.

Also I didn't even know that a log-normal distribution was an official thing, with a Wikipedia page and all. The value of the mean, which I failed to derive, has kindly been provided already (and is simple to compute), and maybe some of the other equations are useful too. A log-normal would be nice because it doesn't need a complicated, expensive and numerically tricky gamma (or gammaln) function.

A concern here is, would different variances on h lead to different predictRecall? That is, if I was confident in h for one quiz but tentative for another, would the uncertainty change my prediction of recall for the two quizzes? (It does in Ebisu, which is kind of nice—compare predictRecall((5,5, 1), 5.) vs predictRecall((2,2, 1), 5.))

I can see how this arises, but it's an inconvenient property because it makes predictRecall more expensive. My intuition says that there might be a probability distribution on the halflife that doesn't have this property, i.e. whose predictRecall would depend only on mu and not on sigma. Of course now I'm hoping that it'll turn out to be the gamma or log-normal... but probably not. Maybe one could work backwards from this requirement and derive what such a distribution would have to look like?

fasiha commented 4 years ago

A comment—Monte Carlo can help you quickly get a sense of how different priors on different variables behave. For example, here's a snippet to simulate a log-normal and shows you what happens for various parameters, tnow, and quiz result:

import numpy as np
from scipy.stats import norm

def weighted(samples, likelihood=[1]):
  sum = np.sum(likelihood)
  mean = np.sum(likelihood * samples) / sum
  var = np.sum(likelihood * (samples - mean)**2) / sum
  return dict(mean=mean, std=np.sqrt(var))

def monteCarlo(priorHalflifeMean,
               priorLogHalflifeStd,
               tnow,
               result,
               size=1_000_000,
               random_state=1):

  # generate some Monte Carlo samples
  logHalflife1 = norm.rvs(
      np.log(priorHalflifeMean), priorLogHalflifeStd, size=size, random_state=random_state)
  halflife1 = np.exp(logHalflife1)  # also Monte Carlo
  prior = 2**(-tnow / halflife1)  # also Monte Carlo! Probability of recall at `tnow`

  # This is the Bernoulli likelihood: a *probability*
  likelihood = prior if result else (1 - prior)

  print('### prior halflife mean={}, prior log-halflife std={}, tnow={}, result={}'.format(
      priorHalflifeMean, priorLogHalflifeStd, tnow, result))
  print('- probability of recall before quiz', np.mean(prior))
  # print('- prior halflife', np.mean(halflife1))
  # print('- posterior halflife moments', weighted(halflife1, likelihood))
  print('- prior LOG-halflife', np.mean(logHalflife1))
  print('- posterior LOG-halflife moments', weighted(logHalflife1, likelihood))

priorHalflifeMean = 1.0
priorLogHalflifeStd = 0.5
monteCarlo(priorHalflifeMean, priorLogHalflifeStd, 1, True)
monteCarlo(priorHalflifeMean, priorLogHalflifeStd, 10., True)
monteCarlo(priorHalflifeMean, priorLogHalflifeStd, 10., False)
monteCarlo(priorHalflifeMean, priorLogHalflifeStd * 3, 10., False)

This prints out:

prior halflife mean=1.0, prior log-halflife std=0.5, tnow=1, result=True

probability of recall before quiz 0.48983648967076804
prior LOG-halflife 0.0003259021540096171
posterior LOG-halflife moments {'mean': 0.16380952656830788, 'std': 0.4637500150864052}
prior halflife mean=1.0, prior log-halflife std=0.5, tnow=10.0, result=True
probability of recall before quiz 0.009212625441209887
prior LOG-halflife 0.0003259021540096171
posterior LOG-halflife moments {'mean': 0.817012768842593, 'std': 0.3720359298225571}
prior halflife mean=1.0, prior log-halflife std=0.5, tnow=10.0, result=False
probability of recall before quiz 0.009212625441209887
prior LOG-halflife 0.0003259021540096171
posterior LOG-halflife moments {'mean': -0.007267886784719726, 'std': 0.4944904379006638}
prior halflife mean=1.0, prior log-halflife std=1.5, tnow=10.0, result=False
probability of recall before quiz 0.09298619984884467
prior LOG-halflife 0.0009777064620288501
posterior LOG-halflife moments {'mean': -0.23152059543537987, 'std': 1.3477729379648018}

Which shows us that the scheme has nice properties, it responds as expected to tnow much higher (and presumably lower) than the expected halflife, that the uncertainty about the halflife changes things in the correct way, etc.

it's an inconvenient property because it makes predictRecall more expensive. My intuition says that there might be a probability distribution on the halflife that doesn't have this property, i.e. whose predictRecall would depend only on mu and not on sigma

Hmm, if this is the goal, I'm not sure if a Bayesian approach would be fruitful. The expected recall probability E[f(h)] = integral[f(h) * p(h), all h], where f is exponential and p is the probability density, will under normal circumstances be a function of the width of the density p(h). Though I haven't proved it, it seems unlikely that you could find a general p(h) that allows the expectation to be the same even as you widen or narrow the density.

You can use the above Monte Carlo snippet to check that indeed as you change the sigma of log(h), you get different recall probability.

However, what you mentioned in your earlier issue,

I was trying a simpler model: exponential decay where right answers multiply the halflife by a hardcoded factor >1, like 1.5, and wrong answers by a different factor <1, like 0.5. But I was unable to find a combination of factors that (subjectively) did a better job than Ebisu

it seems like this should work without bringing in Bayesian machinery. You could store (halflife, # of quizzes) as your model; probability of recall is just 2**(-tnow / halflife); then the update could aggressively scale the halflife if # of quizzes is low, and lightly if high. In this scheme, # of quizzes takes the place that the prior would have had in updating.

If the goal is just to have blazing fast predictRecall, then I've always thought that the way to do this would be to store the output of predictRecall at ten or so future timestamps (either during update, or periodically), and then when you want to find the probability of recall now, just find the nearest timestamp to now and use that one's probability.

If you're concerned about accuracy,

you can use this simple scheme to reduce the number of quizzes to run a full predictRecall on, and
furthermore, I usually don't quiz on the item with the absolute lowest recall, because it kind of leads to predictability—if you review once a day, say, sometimes you can tell what the first quiz is going to be (because it was the most recent one learned), and that's annoying. I often either (a) find the 5% of quizzes with lowest recall probability and pick one at random to quiz or (2) find all quizzes below a threshold (.01%, 0.1%, 1%, etc.) and pick one at random.

Nevertheless, I do think it's worth digging into how to do updateRecall after putting a prior on halflife!

fasiha commented 4 years ago

Edit Oops, I stopped too soon in the derivation of the posterior when you model log-halflife as normal—here's a corrected version: it's still not clear how to analytically find the mean and variance of the posterior on log-halflife in this case, but maybe someone can point out a solution:

lognormal

Markdown/LaTeX for the above

(code to convert the following to the above, in PDF: `pandoc lognormal.md -o lognormal.pdf --pdf-engine=xelatex`) ``` Recall how to transform random variables: if we have a deterministic function $g$ such that $y = g(x)$, and we have a distribution over $P_X(x)$, then we first compute the inverse $$ x = g^{-1}(x) $$ and then we have $$ P_Y(y) = P_X(g^{-1}(y)) \frac{∂}{∂y} g^{-1}(y). $$ Let's start with $$y = g(x) = 2^{-t / \exp(x)}.$$ This happens when $X ∼ Normal(μ, σ^2)$ is log-halflife, then $\exp(x)$ is halflife, and $y$ is probability of recall at time $t$. Then, the inverse $$ x = \log\left( \frac{-t}{\log_2 y} \right) = g^{-1}(y) $$ and surprisingly, via Wolfram Alpha, $$ \frac{∂}{∂y} g^{-1}(y) = \frac{-1}{y \log y}. $$ The transformation gives us the distribution of the probability of recall at time $t$: $$ P_Y(y) ∝ \exp \left( -\frac{1}{2σ^2} \left[ -μ + \log\frac{-t}{\log_2 y} \right]^2 \right) \frac{1}{y \log y} $$ where we used the normal density for $P_X(x)$. Let's consider updating this prior with the successful Bernoulli quiz's likelihood to give us the posterior: $$ P_{Y,post1}(y) \propto y P_Y(y). $$ The equivalent for the negative quiz would be $P_{Y,post0} \propto (1-y)P_Y(y)$. Just as we transformed the original distribution on log-halflives $x$ to recall probability $y$ through $y=g(x)$, we can take the posterior on recall probability $y$ back to a posterior on log-halflife $x$. Given that $$ \frac{∂}{∂x} 2^{-t/\exp(x)} = 2^{-t/\exp(x)} \log\left( 2^{+t/\exp(x)} \right), $$ we have the posterior on log-halflife $$P_{X,post1}(x) ∝ P_{Y,post1}(2^{-t/\exp(x)}) ⋅ \frac{∂}{∂x} 2^{-t/\exp(x)}$$ which after some simplification becomes $$ P_{X,post1}(x) ∝ 2^{-t/\exp(x)} \exp\left( -\frac{(x-\mu)^2}{2σ^2} \right). $$ Finding the mean and variance of this distribution analytically appears to be difficult. ```

It's probably worth checking with a few other priors on the halflife or log-halflife, to see if the posterior's form is less fearsome.

fasiha commented 4 years ago

Oops, my previous post was incomplete, I updated it to be more correct.

ttencate commented 4 years ago

Thank you for looking into this! Pity it didn't work out as nicely as I hoped. Exponentials-in-exponentials... that looks daunting indeed, although I was never much good at integration.

The expected recall probability E[f(h)] = integral[f(h) * p(h), all h], where f is exponential and p is the probability density, will under normal circumstances be a function of the width of the density p(h). Though I haven't proved it, it seems unlikely that you could find a general p(h) that allows the expectation to be the same even as you widen or narrow the density.

I'm not at all convinced that this is impossible. In fact I feel that the overall shape of p(h) might be completely determined by this restriction. The degenerate case where the distribution has zero width already meets the requirement, so all we need to do is widen that to be not infinitisemally small :wink: If I have some time, I might take a shot at it.

fasiha commented 4 years ago

Good news! I was able to simplify aggressively and found that with a Gamma prior on halflife directly, the posterior turns out to have tractable moments, and can be fit back to a Gamma! It uses the modified Bessel function of the 2nd kind, which thank goodness Scipy provides and will be fun to find JS/Java/Dart versions of 😇.

I'm putting the finishing touches on the negative boolean update and will post an update. I haven't looked at how noisy-binary and binomial quizzes will work with it, though I'm tentatively confident they should be tractable too.

There's an elegant way to reframe the posterior when you're working with halflife and log-halflife, which could help us evaluate other priors (I tried log-normal, logistic, chi-square, and Gamma distributions).

ttencate commented 4 years ago

That's good news indeed!

... turns out to have tractable moments, and can be fit back to a Gamma!

Does that mean that the result is actually only an approximation of the posterior, and not the real thing? In fact I've been wondering the same about the original Ebisu algorithm. Not that it really matters, as the whole Ebbinghaus curve is already a very simplified approximation anyway :smile:

It uses the modified Bessel function of the 2nd kind, which thank goodness Scipy provides and will be fun to find JS/Java/Dart versions of :innocent:.

There are beta, gamma and Bessel of the first kind functions in Apache Commons, if that helps. In fact their gammaln was initially the basis for my port... but it was pretty complicated and didn't pass one of the unit tests. Whether that was an issue with that implementation, with yours, or with my port, I daren't say; I ended up just porting your much simpler version.

So what are your thoughts on this model compared to the original beta distribution?

fasiha commented 4 years ago

Does that mean that the result is actually only an approximation of the posterior, and not the real thing? In fact I've been wondering the same about the original Ebisu algorithm

Correct, in general none of these these posteriors are conjugate, meaning they're some different distribution than the prior after the prior goes through the binary update. For both Beta-on-recall and Gamma-on-halflife, we have an analytical expression for the posterior, but in order to return a new model that's still Beta or Gamma, we have to moment-match, which is the one and only approximation.

(The one exception was, with Beta-on-recall, the time-traveled prior (a GB1 distribution) is conjugate with the binary update if the quiz was successful, but not otherwise. So not really super-useful.)

fasiha commented 4 years ago

Edit corrected the code below after initially posting.

I think we may have to give up on the Gamma prior on halflife. While my attention was distracted making sure that updates after binary quizzes are doable, I failed to notice that predictRecall is actually more complicated than the Beta-on-probability prior 😢!

To see this, consider—

import numpy as np
from scipy.stats import gamma as gammarv
from scipy.special import gamma
from scipy.special import kv

a = 2.
b = 1.5
t = (a / b)  # mean halflife!

halflifes = gammarv.rvs(a, scale=1 / b, size=1_000_000)
prob = np.exp(-t / halflifes)

predictRecall = lambda a, b, t: 2 * (t * b)**(a / 2) * kv(a, 2 * np.sqrt(t * b)) / gamma(a)
predicted = predictRecall(a, b, t)

print('predictRecall @ time={}: analytical={}, Monte Carlo={}'.format(t, predicted, np.mean(prob)))
# predictRecall @ time=1.3333333333333333: analytical=0.3092345700088991, Monte Carlo=0.30936420184691676

(I use exp(-time / halflife) instead of 2**(-time / halflife) because the math was more straightforward. So although I refer to it as "halflife" in the code snippet, it's actually just a "time constant". But this doesn't change anything. Also, I use the alpha/beta parameterization of the Gamma distribution.)

Anyway, the point is, it's not analytically straightforward to calculate what the predicted recall is for a given t, nor which t gives rise to a predicted recall of, say, 0.5. t equal to the mean a/b or the mode (a-1)/b don't result in any special predicted recall. This is a case of Jensen's inequality, which I allude to in the README—basically, E[f(x)] ≥ f(E[x]) when f is convex, with equality only when f is linear. In our case, x is the halflife and f(x) = exp(-t / x), which is convex, so predictRecall, i.e., E[f(x)], which we want, will be greater than the straightforward application of the Ebbinghaus exponential decay function on the mean halflife. Which is bad because usually we don't know how big this difference is.

Actually, it's helpful to see this visually:

ts = np.linspace(0, 10)
import matplotlib.pylab as plt
plt.ion()
plt.figure()
plt.plot(ts, np.exp(-ts / (a / b)), label='approx', linewidth=4)
plt.plot(ts, np.vectorize(lambda t: predictRecall(a, b, t))(ts), label='pred1', linewidth=3)
plt.plot(ts, np.vectorize(lambda t: predictRecall(a * 3, b * 3, t))(ts), label='pred2', linewidth=2)
plt.plot(ts, np.vectorize(lambda t: predictRecall(a / 3, b / 3, t))(ts), label='pred3', linewidth=1)
plt.legend()
plt.xlabel('time')
plt.ylabel('recall probability')
plt.savefig('gamma.png', dpi=300)

produces the following: gamma and while this disproves what I said about Jensen's inequality and the convexity of exp(), I think it's helpful to see the range of values that predictRecall can output for Gamma distributions with the same mean (all three "pred1"–"pred3" have the same mean).

So since the Gamma-on-halflife doesn't really buy us anything over the Beta-on-probability (other than possibly simplifying the update step by getting rid of rebalancing), and might make it more complicated to do noisy-binary and binomial quizzes, I'm going to try and focus on quantifying the risk of just replacing predictRecall in the current Ebisu case with the approximation 2**(-t / halflife) or equivalently t / halflife (since 2**(x) is monotonic in x, so both will sort the same set of models the same way). halflife would be computed in the previous update, per #31.

If we can show that that approximation is reasonable, then quiz apps can benefit from having a super-fast alternative to predictRecall.

What do you think?

ttencate commented 4 years ago

I use exp(-time / halflife) instead of 2**(-time / halflife) because the math was more straightforward.

Good call. Maybe we should call it "elflife"? It's fitting, because elves do have a longer lifespan than halflings ;)

Thinking about Jensen's inequality: this seems to dash my hopes of finding any distribution d(h, alpha) on the elflife, with parameters h (elflife) and alpha (uncertainty), satisfying E(d(h, alpha)) = h, and such that predictRecall is a simple exponential, in other words

E[exp(-t / d(h, alpha)] = exp(-t / h)

because this is essentially (setting f(h) = exp(-t / h)):

E[f(d(h, alpha))] = f(h)

but Jensen gives, for nonlinear f,

E[f(d(h, alpha))] > f(E[d(h, alpha)]) = f(h)

meaning it can never be equal. That seems to be pretty much what you said above as well.

I also thought, at first glance, that your plot does seem to contradict that conclusion: approx which represents f(E[x]) goes both above and below the "real" value pred1 which represents E[f(x)]. But that's not what Jensen's equation says: it's only about the expected value of these distributions, not about the distributions themselves. And the expected value of approx is clearly lower (more to the left) than that of pred1.

But maybe we can quantify the effect that changes to alpha / beta have on the shape of this curve? I was hoping it might turn out to be something simple, but a first experiment doesn't seem encouraging. Building upon your code, I plotted the quotient of approxRecall over predictRecall, the latter being our simple exponential:

def approxRecall(a, b, t):
    h = a / b
    return np.exp(-t / h)
ts = np.linspace(0, 5, 100)
plt.ion()
plt.figure(figsize=(20, 10))
for i in (0.1, 0.3, 1, 3, 10, 30):
    #plt.plot(ts, np.vectorize(lambda t: predictRecall(a * i, b * i, t))(ts), label='pred%.1f' % i, linewidth=4)
    #plt.plot(ts, np.vectorize(lambda t: approxRecall(a * i, b * i, t))(ts), label='approx%.1f' % i, linewidth=2)
    plt.plot(ts, np.vectorize(lambda t: approxRecall(a * i, b * i, t) / predictRecall(a * i, b * i, t))(ts), label='error%.1f' % i, linewidth=2)
plt.legend()
plt.xlabel('time')
plt.ylabel('error in approx recall probability')
plt.savefig('error.png', dpi=96)

error

Those don't look like any basic function that I know of. I thought that they would at least agree at t = e but apparently not! Or would that be a numerical issue?

Edit: inverted the graph because it makes more sense this way.

fasiha commented 4 years ago

I also thought, at first glance, that your plot does seem to contradict that conclusion: approx which represents f(E[x]) goes both above and below the "real" value pred1 which represents E[f(x)]. But that's not what Jensen's equation says: it's only about the expected value of these distributions, not about the distributions themselves. And the expected value of approx is clearly lower (more to the left) than that of pred1.

I'm not following—as you say, the biggest blue "approx" curve represents f(E[x]), i.e., exp(-t / E[x]) whereas the other curves represent E[f(x_i)] = E[exp(-t / x_i)] for various x_i. The latter seem to be less than the former on the left part of the plot but greater than the former on the right part of the plot, which seems like it contradicts Jensen, since exp is strictly convex right?

We can show the same thing with the Monte Carlo samples—after running my first code snippet, I can do the following in IPython:

In [7]: jensen = lambda t: [np.mean(np.exp(-t / halflifes)), np.exp(-t / np.mean(halflifes))]

In [8]: jensen(0.5)
Out[8]: [0.5855561840096577, 0.6874954822547535]

In [9]: jensen(10)
Out[9]: [0.007355087627151486, 0.0005564126211527937]

which show [ E[f(x)], f(E[x]) ] for two ts. I'm confused but am sure it's something very stupid I'm doing…

Those don't look like any basic function that I know of.

We know analytically what those are though, because we know analytically what E[f(x)] truly is (the scary predictRecall expression with kv Bessel function) 😆.

I'll write up the math for predictRecall and updateRecall for Gamma halflife (or time constant), it's not too hard since the heavy lifting is done via Wolfram Alpha symbolic integration—it could help someone investigate other possible choices for halflife (Rice distribution? Weibull?).

But as I mentioned earlier, since we don't have a predictRecall that's simpler than the Beta-on-recall case, I think we're going to be better-served by just evaluating #31, and people who want fast predictRecall over lots of models should use the approximation -tnow / halflife, where tnow = (now - lastReviewed) and halflife comes from rebalancing the model.

fasiha commented 4 years ago

Errata in the below, the following corrections have to be made (I don't want to edit the LaTeX and upload a new image, forgive me)

I accidentally use a lowercase n in h^n below, it should be h^N.
I claim the moments follow "via straightforward application of the expectation above", but actually you need a little bit more Wolfram Alpha to get these results. In the first expectation, we have (t * b)^(alpha/2), whereas in the posterior moments, we need to split up t and b. If there's confusion about this step, let me know and I'll update the details.

Original:

Here's a quick writeup of the math for Gamma-on-time-constant:

gamma

See the Markdown+LaTeX source

```markdown --- geometry: "left=3cm,right=3cm,top=2cm,bottom=2cm" --- Suppose we have a flashcard with time constant (in arbitrary units) $h ∼ Γ(α, β)$, that is, drawn from a Gamma distribution with known parameters $α$ and $β$ (we use "h" here to allude to "halflife" though that requires the exponent to be 2, not $e$ as we have below); and $t$ the time since last review, in units consistent with $h$. Then the expected recall probability of this flaschard at time $t$ is $$ E[\exp(-t/h)] = \frac{β^α}{Γ(α)} \int_0^∞ \exp(-t/h) ⋅ h^{α - 1} \exp(-β h) \, dh = 2 (t β)^{α/2} K_α(2 \sqrt{t β}) / Γ(α), $$ where we use $E[f(h)] = \int_0^∞ f(h) P(h) \, dh$ (since $h$ positive), and where, apologies, $Γ$ is reused as the gamma function (the generalization of the factorial to the reals), and where $K_α$ denotes the modified Bessel function of the second kind of order $α$. I found this via Wolfram Alpha: `Integrate[exp(-3 / h) * h^(a - 1) * exp(-b * h) * b^a / Gamma[a], h, 0, inf]` (it doesn't work if you leave it as `t`, but does work when you use `3=t`). Next, after a quiz $x ∼ Bernoulli(\exp(-t/h))$, i.e., a coin toss, we can ask for the posterior on the halflife $h$ in light of the quiz $x$: $$ P(h | x) = \frac{P(x | h) P(h)}{\int_0^∞ P(x | h) P(h) \, dh} $$ where the denominator is simply a normalizing constant equal to the integral of the numerator (so the resulting posterior has integral 1); the numerator is the likelihood $P(x | h)$ and the prior $P(h)$. We can break this up into the two cases of $x$: $$ P(h|x=1) = \frac{ \exp(-t/h) ⋅ h^{α - 1} \exp(-β h) }{\int_0^∞ \exp(-t/h) ⋅ h^{α - 1} \exp(-β h) \, dh}. $$ The $N$th moment of this distribution, i.e., $E[(h|x=1)^N] = \int_0^∞ h^n P(h|x=1) \, dh$ turns out to be, via straightforward application of the expectation above, equal to $$ m_{N, x=1} = \frac{ t^{(α + N)/2} β^{-(a+N)/2} K_{α + N}(2 \sqrt{t b}) }{ t^{α/2} β^{-a/2} K_{α}(2 \sqrt{t b}) } $$ The first moment $m_1$ is the posterior *mean*, and the posterior variance is simply $\sigma^2 = m_2 - m_1^2$. If you moment-match these to a new Gamma distribution, your new $(α', β') = (m_1^2 / σ^2, m_1 / σ^2)$. For lightness, I've omitted the $x=1$ subscript here. When $x=0$, i.e., a quiz failure, the likelihood is $P(x=0|h) = (1-\exp(-t/h))$. Repeating the exercise above, if we haven't flubbed the bookkeeping, the moments of this posterior are $$ m_{N,x=0} = \frac{ m'_N Γ(α) β^{-α} - t^{(α + N)/2} β^{-(a+N)/2} K_{α + N}(2 \sqrt{t b}) }{ Γ(α) β^{-α} - t^{α/2} β^{-a/2} K_{α}(2 \sqrt{t b}) } $$ where $m'_N$ are the moments of the original Gamma distribution: via Wikipedia etc., $m'_1 = α / β$ and $m'_2 = α / β^2 + {m'_1}^2$. In this numerator and denominator, you can see the numerator and denominator of the previous $m_{N,x=1}$ case, and we may be able to simplify further. As before, one can moment-match a new Gamma to these. ``` I rendered the above via `pandoc gamma.md -o gamma.pdf --pdf-engine=xelatex`.

I implemented this and verified it via numerical integration in https://github.com/fasiha/ebisu/blob/889588a5774a24a0820b46517c22ea2fbfa1ca07/gam.py.

ttencate commented 4 years ago

Nice work! But I agree that it seems to have no advantage over the beta-on-recall model with rebalancing. Still, it was interesting and educational. Thank you for investigating!

jwtnb commented 3 years ago

hi, thank you very much for the amazing library. @fasiha

If the goal is just to have blazing fast predictRecall, then I've always thought that the way to do this would be to store the output of predictRecall at ten or so future timestamps (either during update, or periodically), and then when you want to find the probability of recall now, just find the nearest timestamp to now and use that one's probability.

If you're concerned about accuracy,

you can use this simple scheme to reduce the number of quizzes to run a full predictRecall on, and

furthermore, I usually don't quiz on the item with the absolute lowest recall, because it kind of leads to predictability—if you review once a day, say, sometimes you can tell what the first quiz is going to be (because it was the most recent one learned), and that's annoying. I often either (a) find the 5% of quizzes with lowest recall probability and pick one at random to quiz or (2) find all quizzes below a threshold (.01%, 0.1%, 1%, etc.) and pick one at random.

Nevertheless, I do think it's worth digging into how to do updateRecall after putting a prior on halflife!

Could you please explain a little bit more on the technique of calculating 10 predict recalls at different time intervals (are they fixed timestamps for all terms like 1/1/21, 2/1/21 etc or sort of review_time + delta(t=1 day, 1 week etc)? ) and then when we need to select the terms to study, how should we. go about it?

From my understanding of your reply, for each term after a review, we will update the term model with predictRecall of 10 different time intervals (review_time + delta(t)) at T0 we want to find which terms to review, we would look sort terms by (predict_recall_timestamp, probability) say lower 10% and pick a random term to review?

Thanks

fasiha commented 3 years ago

@jwtnb thanks for the questions!

First thing I should say is, as I mention in #53, I'm working on a rewrite of Ebisu and it's going to embrace @ttencate's idea in this issue: the Bayesian prior is on halflife, and even though it's an approximation, predictRecall is going to be a very simple expression: -(now - lastSeen) / halflife (that's the log-probability-of-recall; take the exp() of that to get the actual recall probability). You can do this in SQL 😄 !

The really simple scheme I proposed,

store the output of predictRecall at ten or so future timestamps (either during update, or periodically), and then when you want to find the probability of recall now, just find the nearest timestamp to now and use that one's probability

works just like you describe. The idea is, you run predictRecall

six hours in the future after last quiz,
24 hours in the future after last quiz,
4 days in the future after last quiz,
16 days in the future after last quiz,
etc.,

or some similar exponentially-increasing list of times and store that in a database. Then at some point in the future, you look for the two (maybe one) time slots that bracket the current time, and do a linear interpolation. … That is probably very confusing, sorry, hopefully this code snippet is more clear 😅:

import ebisu

# Courtesy of https://stackoverflow.com/a/62314804
def lerp(x1: float, x2: float, y1: float, y2: float, x: float):
  """Perform linear interpolation for x between (x1,y1) and (x2,y2) """
  return ((y2 - y1) * x + x2 * y1 - x1 * y2) / (x2 - x1)

# at quiz time, put these in the database
halflife = 1.0
model = (3., 3., halflife)
# equivalent to 4**numpy.arange(-1, 3.6, 0.5)
futures = [0] + [4**(-1 + i * 0.5) for i in range(10)]
pRecalls = [ebisu.predictRecall(model, future) for future in futures]

# now let's figure out the lerp step using these lists
def predictRecallCached(model, tnow: float, futures: list[float], pRecalls: list[float]) -> float:
  """Try to linearly interpolate `predictRecall`"""
  rightidx = next(filter(lambda idx: futures[idx] > tnow, range(len(futures))), None)
  if rightidx is not None:
    # tnow <= futures[-1], we can do linear interpolation!
    leftidx = rightidx - 1
    # We know leftidx >= 0 because `futures[0]==0`!
    return lerp(futures[leftidx], futures[rightidx], pRecalls[leftidx], pRecalls[rightidx], tnow)
  else:
    # tnow > futures[-1]
    return ebisu.predictRecall(model, tnow)

# some time later
tnow = 50 # time since last quiz
pRecall = predictRecallCached(model, tnow, futures, pRecalls)
print(f'pRecall = {pRecall:0.2g}')

Hopefully that makes sense! You can tweak how many samples of predictRecall you want to cache, maybe you don't even bother with lerp, you just pick the closest one, etc.

(Ebisu v3 that I mentioned above is going to remove all this headache because everything will be in terms of halflife.)

Regarding the other tweak I mentioned, you asked:

we would look sort terms by (predict_recall_timestamp, probability) say lower 10% and pick a random term to review

this is an app-level decision beyond Ebisu but yes, I think it's better to (slightly) randomize quizzes with low probability of recall. So you can

do a partial sort, find the five quizzes, or 5% of quizzes, with the lowest predicted recall and then pick one randomly. But,
suppose your two weakest cards have recall 0.01 vs 0.5, it might be better to ask the 0.01 one first. So maybe you can first bucket cards (cards with recall < 0.001, < 0.01, < 0.1, etc.), and then from the first bucket that has any cards, sort them and randomly pick from among the weakest cards in that bucket? Let me know if this is confusing, I can sketch it in Python too.

But again, these issues have more to do with your quiz app, Ebisu doesn't care 😄.

fasiha / ebisu

A probability distribution on the halflife itself? #32

prior halflife mean=1.0, prior log-halflife std=0.5, tnow=1, result=True

prior halflife mean=1.0, prior log-halflife std=0.5, tnow=10.0, result=True

prior halflife mean=1.0, prior log-halflife std=0.5, tnow=10.0, result=False

prior halflife mean=1.0, prior log-halflife std=1.5, tnow=10.0, result=False