fasiha / ebisu

Public-domain Python library for flashcard quiz scheduling using Bayesian statistics. (JavaScript, Java, Dart, and other ports available!)
https://fasiha.github.io/ebisu
The Unlicense
313 stars 32 forks source link

Simplification of update #5

Closed rkern closed 5 years ago

rkern commented 5 years ago

Playing with your formulae, I noticed that the update algorithm could be simplified significantly, at least in the positive recall case. First, I noticed that time-shifting the Beta distribution by delta gives a Generalized Beta Distribution of the First Kind (with a=1/delta; b=1 in the terms of that page). Unfortunately, the literature seems to be sparse on that particular formulation (or I'm not using the right search terms), but the formula for moments provided on the page is quite useful. It does look like the GB1 distribution is conjugate for a Bernoulli trial, with an easy analytical formula for the positive-recall update and at least a good numerical moment-matching fit for the negative-recall case.

I've made a notebook over here to explore the possibilities.

Notice that when you multiply the likelihood (p) to the prior, you can fold it into the term (p^((alpha-delta)/delta)*p = p^((alpha+delta-delta)/delta) which is formally the same as if you had alpha+delta instead of alpha. So at least for the positive recall case, you can leave the time part of the model alone and just increment alpha += delta and leave the time part of the model alone. Of course, when delta==1.0, where our prior is a pure Beta distribution, that collapses to the well-known Bayesian update for a Beta distribution and a Bernoulli trial. You can also work through the moment formula given on the Wikipedia page to verify that it reproduces your moments when you use alpha + delta in the Wikipedia formula (basically, the nth moment is gamma[n+1]/gamma[1], in your shorthand). Basically, when you evolve the posterior distribution back to the original time constant, you get back another pure Beta distribution. Qualitatively, this update makes some sense. When delta is low, the prior is more heavily weighted towards 1, so a positive recall is less surprising and therefore updates less. When delta is high, the prior is more heavily weighted towards 0, so a positive recall is more surprising and updates more.

The negative recall case is more tricky, and I haven't made any headway on getting an analytical form. However, it is relatively easy to match the moments using numerical optimization by adjusting both alpha and beta in the GB1 distribution. Again, like the positive-recall update, the time constant is left alone. I do wonder, however, if maybe adjusting beta and delta (e.g. by way of adjusting the time constant) would lead to a concise analytical form. That would make a better formal correspondence to the negative-recall update in the pure Beta case.

If the time constant (i.e. the time at which the prior is a pure Beta) is kept fixed, that might mean that when the actual half-life has increased to many times the original value, you might get numerical instabilities. It might be a good idea to always move out the time constant to the estimated half-life. One scheme could be to do the GB1 update where the time constant remains fixed, then project that distribution out to the estimated half-life, find the moments there, and then moment-match a pure Beta distribution there.

What originally led me to investigate this is when I was considering what would happen if I happened to review twice very quickly (my Anki history has some of these due to Anki's lapse algorithm). With Ebisu's current scheme of using a time constant corresponding to the last review interval, that created large numerical problems in my tests. The pure Beta approximation for such a small delta is likely not great. t would be very tiny, so a normal review after that would project that not-great approximation out to quite a high delta. I think that using the GB1 update, either alone or with my suggestion of telescoping out the time constants to match the estimated half-lives, would probably alleviate some of those issues.

Thank you for your attention and an interesting, well-thought-out algorithm.

fasiha commented 5 years ago

Wow, thank you so much Robert, this is gold! I'm awestruck that you recognized the pdf of the time-decayed prior as this GB1 random variable's pdf!

This is superb—I have often wished we could fix a Beta posterior at some time other than the time since last quiz, and now we have that!, and what's more with a very elegant posterior update for the success quizzes, and the possibility of an analytical posterior for failed quizzes too: your hint about optimizing beta and delta for failed quizzes was inspired, because (for a failed quiz) if you fix alpha1 to be alpha0 and initialize the search for beta1 to beta + 1 and delta1 to delta, you see that in high-alpha or high-beta situations, the optimization finds GB1 random variables very close to the pure-Beta cases. In low-alpha and low-beta cases, studying the tables of [beta, delta] results shows tantalizing patterns that I hope we can analytically track down.

I'm hopeful that we can let the time parameter of the prior can remain the original half-life even after many successful reviews (when the halflife becomes very large), because (1) predicting the recall probability and (2) updating upon success ought to be very stable. Just we'll have to confirm that (3) updating upon failure remains stable when over- and under-reviewing long-half-lived models (and short-half-lived ones too).

I will certainly look at this more! Thank you so much for taking the time to read through the derivation, and thank you so much more for finding these wonderful simplifications and for thinking about them so deeply and describing them so clearly. I am beyond grateful.


Messy code below:

def match_moments(m1, m2, prior, tnow, obs, algo='orig'):
    alpha0, beta0, t0 = prior
    delta = tnow / t0
    ab0 = np.array([alpha0, beta0])
    if obs:
        ab0[0] += delta
    else:
        # from looking at results
        ab0[1] += 1

    if algo == 'orig':
        def f(ab):
            alpha, beta = ab
            mom1, mom2, _ = genbeta1_moments(alpha, beta, delta)
            return np.array([mom1, mom2]) - np.array([m1, m2])
        return optimize.least_squares(f, ab0, bounds=((1.1, 1.1), (np.inf, np.inf)))
    if algo=='bd': # estimate beta, delta
        ab0 = np.append(ab0, delta)
        def f(x):
            beta, d = x
            mom1, mom2, _ = genbeta1_moments(ab0[0], beta, d)
            return np.array([mom1, mom2]) - np.array([m1, m2])
        ret = optimize.least_squares(f, [ab0[1], ab0[2]] , bounds=((1.1, 0), (np.inf, np.inf)))
        ret['origx'] = ret['x'].copy()
        ret['x'] = np.array([ab0[0], ret['x'][0], ret['x'][1]])
        return ret

rows = []
for alpha0 in [2.0, 3.0, 4.0, 5.0, 10.0, 20.0, 40.0, 100.0, 200.0, 400.0]:
    for beta0 in [2.0, 3.0, 4.0, 5.0, 10.0, 20.0, 40.0, 100.0, 200.0, 400.0]:
        for delta in [0.1, 0.2, 0.5, 0.9, 1.1, 1.5, 2.0, 3.0, 4.0, 5.0, 10.0]:
            prior = (alpha0, beta0, 1.0)
            m1, m2, _ = posterior_moments(prior, delta, False)
            res = match_moments(m1, m2, prior, delta, False)
            alpha1, beta1 = res.x
            res2 = match_moments(m1, m2, prior, delta, False, algo='bd')
            alpha2, beta2, delta2 = res2.x
            gb1_m1, gb1_m2, _ = genbeta1_moments(alpha1, beta1, delta)
            rows.append([alpha0, beta0, delta, m1, m2, alpha1, beta1, gb1_m1, gb1_m2, res.cost, alpha2, beta2, delta2, res2.cost])
df = pandas.DataFrame(data=rows, columns=['alpha0', 'beta0', 'delta', 'm1', 'm2', 'alpha1', 'beta1', 'gb1_m1', 'gb1_m2', 'cost', 'alpha2', 'beta2', 'delta2', 'cost2'])

print(df.cost.max())
print(df.cost2.max())

print(df[['alpha0','beta0','delta','alpha2','beta2','delta2','cost2']].to_string())

This prints out the entire table of original and final parameters, and the patterns seem very interesting.

For example,

      alpha0  beta0  delta  alpha2       beta2     delta2         cost2
176      3.0   40.0    0.1     3.0   40.201294   0.103797  8.567468e-24
177      3.0   40.0    0.2     3.0   40.553474   0.205814  3.520064e-25
178      3.0   40.0    0.5     3.0   41.059491   0.505768  1.068200e-21
179      3.0   40.0    0.9     3.0   41.052325   0.901126  2.343093e-19
180      3.0   40.0    1.1     3.0   40.939938   1.098986  2.966735e-17
181      3.0   40.0    1.5     3.0   40.683969   1.496215  1.516459e-23
182      3.0   40.0    2.0     3.0   40.425251   1.995119  2.551649e-21
183      3.0   40.0    3.0     3.0   40.706871   2.977962  6.524657e-13
184      3.0   40.0    4.0     3.0   41.000000   4.000000  3.001290e-11
185      3.0   40.0    5.0     3.0   41.000000   5.000000  1.011160e-12
186      3.0   40.0   10.0     3.0   41.000000  10.000000  3.097877e-19

Note how the beta2 column, for the final rows, is very close to the beta0 column plus one, and the delta2 column is very close to the delta column—matching the pure-Beta case of a failed review.

For the high-alpha low-beta case:

      alpha0  beta0  delta  alpha2       beta2     delta2         cost2
550     20.0    2.0    0.1    20.0    3.002139   0.102049  3.092525e-24
551     20.0    2.0    0.2    20.0    3.002060   0.203611  1.493193e-22
552     20.0    2.0    0.5    20.0    3.001568   0.505502  5.174227e-16
553     20.0    2.0    0.9    20.0    3.000381   0.901917  8.476947e-22
554     20.0    2.0    1.1    20.0    2.999588   1.097693  1.582252e-19
555     20.0    2.0    1.5    20.0    2.997659   1.484749  1.068512e-23
556     20.0    2.0    2.0    20.0    2.994698   1.960808  9.316126e-22
557     20.0    2.0    3.0    20.0    2.987389   2.890322  1.623160e-19
558     20.0    2.0    4.0    20.0    2.978818   3.794316  3.779063e-18
559     20.0    2.0    5.0    20.0    2.969484   4.677117  2.594782e-17
560     20.0    2.0   10.0    20.0    2.920202   8.867487  7.869726e-16

Again, beta2 column is very close to beta+1. The delta2 column is lower than the original delta column. Hmm!

fasiha commented 5 years ago

Also, I'm very glad that my implementation of ebisu.alternate.predictRecallMedian was terrible, because it led you to investigating this and finding the GB1 connection!, but I improved it, per #6. It will work for your test case.

fasiha commented 5 years ago

For clarification, I fixed a problem just in predictRecallMedian so that first plot in your notebook will now work. I haven't added all the GB1 magic.

rkern commented 5 years ago

Ah, great! My first attempt at fitting the beta-deltas didn't fit out as well in the worst case scenarios, but this seems to work well.

I pushed a new version of my notebook. Not much more than what we have already, but I did explore the idea of telescoping out to follow the half-life. In the meantime, I made a more robust half-life estimation routine that doesn't require you to specify a min/max times; it will scan for an appropriate bracketing interval and should Just Work.

Shifting the time constant out to the half-life and approximating out there with a pure Beta distribution has some appeal. The distribution out at the half-life is pretty close to a pure Beta distribution in most cases, only starting to diverge noticeably (but not significantly, in a practical sense) in the case of beta >> alpha. A nice feature of the HalfLifeBeta scheme is that at the half-life, the mean is 0.5, which implies that alpha==beta in the approximate pure Beta distribution there. You would only need to track two parameters: alpha and the time constant. At the half-life, max(alpha, beta) of a pure-Beta approximation is minimized, which means that numerical blow-ups will probably happen less. Nonetheless, I do agree with your claim that if one keeps the time constant fixed (-ish in the case of the beta-delta moment-matching fit) near the original time constant, we probably won't have many numerical problems even predicting out far past it, now that predictRecallMedian is fixed.

fasiha commented 5 years ago

I sought a more detailed look at how the two schemes we're weighing compared in over/under-review and pass/fail quizzes, and found that they agreed quite well. (Code dump at bottom.)

(I also found a negative result. Given a GB1 distribution, I tried searching for another GB1 distribution with the same moments but with a different a (time constant) parameter and where p=q (alpha=beta), as a way to maybe jointly estimate the halflife and the shape of the distribution there, and it works under benign cases but produces some very strange results in extreme cases. I include it here in case anyone else thinks it might be a good idea (or if they can find a flaw in my reasoning).)

I'm satisfied that staying at (roughly) the same time-constant is close enough in accuracy and behavior compared to telescoping to the new halflife.

I was hoping the below exercise, testing behavior at extreme cases, would suggest which of the two approaches are better, but they're both very good! I'm hoping to roll this into the library and push it in the next couple of days! The bulk of the work will be rewriting the math section 😂!

# -*- coding: utf-8 -*-
import numpy as np
import scipy.special as special
import scipy.optimize as optimize
import ebisu
from typing import Tuple

def failureMoments(model: Tuple[float, float, float], result: bool, tnow: float, num: int = 4):
  """Moments of the posterior on recall at time tnow upon quiz failure"""
  a, b, t0 = model
  t = tnow / t0
  from scipy.special import gammaln, logsumexp
  from numpy import log, exp
  s = [gammaln(a + n * t) - gammaln(a + b + n * t) for n in range(num + 2)]
  logt = log(t)
  marginal = logsumexp([s[0], s[1]], b=[1, -1])
  return [exp(logsumexp([s[n], s[n + 1]], b=[1, -1]) - marginal) for n in range(1, num + 1)]

def gb1Moments(a: float, b: float, p: float, q: float, num: int = 2):
  """Raw moments of GB1 via Wikipedia"""
  from scipy.special import betaln
  bpq = betaln(p, q)
  logb = np.log(b)
  return [np.exp(h * logb + betaln(p + h / a, q) - bpq) for h in np.arange(1.0, num + 1)]

def gb1ToBeta(gb1: Tuple[float, float, float, float, float]):
  """Convert a GB1 model (four GB1 parameters, and time) to a Beta model"""
  return (gb1[2], gb1[3], gb1[4] * gb1[0])

def updateViaGb1(prior: Tuple[float, float, float], result: bool, tnow: float):
  """Alternate to ebisu.updateRecall that returns several posterior Beta models"""
  (alpha, beta, t) = prior
  delta = tnow / t
  if result:
    gb1 = (1.0 / delta, 1.0, alpha + delta, beta, tnow)
  else:
    mom = np.array(failureMoments(prior, result, tnow, num=2))

    def f(bd):
      b, d = bd
      this = np.array(gb1Moments(1 / d, 1., alpha, b, num=2))
      return this - mom

    from scipy.optimize import least_squares
    res = least_squares(f, [beta, delta], bounds=((1.01, 0), (np.inf, np.inf)))
    # print("optimization cost", res.cost)
    newBeta, newDelta = res.x
    gb1 = (1 / newDelta, 1, alpha, newBeta, tnow)
  return dict(
      simple=gb1ToBeta(gb1),
      moved=gb1ToBeta(moveGb1(gb1, prior[2])),
      moved2=moveBeta(gb1ToBeta(gb1)))

def moveGb1(gb1: Tuple[float, float, float, float, float], priorT):
  """Given a GB1 model (4 parameters and time), find the closest GB1 distribution where alpha=beta

  This will be at the halflife. Can produce terrible results when stressed. Why?"""
  mom = np.array(gb1Moments(*gb1[:-1], num=4))

  def f(aa):
    newA, newAlpha = np.exp(aa[0]), np.exp(aa[1])
    this = np.array(gb1Moments(newA, 1., newAlpha, newAlpha, num=4))
    return this - mom

  from scipy.optimize import least_squares
  res = least_squares(f, [np.log(gb1[0]), np.log((gb1[2] + gb1[3]) / 2)])
  # print("optimization cost", res.cost)
  print('movegb1res', np.exp(res.x), 'cost', res.cost)
  expx = np.exp(res.x)
  return [expx[0], gb1[1], expx[1], expx[1], gb1[4]]

def estimate_half_life(model, quantile=0.5):
  """Robust half-life (or quantile-life) estimation.

    Use a root-finding routine in log-delta space to find the delta that
    will cause the GB1 distribution to have a mean of the requested quantile.
    Because we are using well-behaved normalized deltas instead of times, and
    owing to the monotonicity of the expectation with respect to delta, we can
    quickly scan for a rough estimate of the scale of delta, then do a finishing
    optimization to get the right value.
    """
  alpha, beta, t0 = model
  Bab = special.beta(alpha, beta)

  def f(lndelta):
    mean = special.beta(alpha + np.exp(lndelta), beta) / Bab
    return mean - quantile

  # Scan for a bracket.
  bracket_width = 6.0
  blow = -bracket_width / 2.0
  bhigh = bracket_width / 2.0
  flow = f(blow)
  fhigh = f(bhigh)
  while flow > 0 and fhigh > 0:
    # Move the bracket up.
    blow = bhigh
    flow = fhigh
    bhigh += bracket_width
    fhigh = f(bhigh)
  while flow < 0 and fhigh < 0:
    # Move the bracket down.
    bhigh = blow
    fhigh = flow
    blow -= bracket_width
    flow = f(blow)

  assert flow > 0 and fhigh < 0

  sol = optimize.root_scalar(f, bracket=[blow, bhigh])
  t1 = np.exp(sol.root) * t0
  return t1

def moveBeta(model: Tuple[float, float, float]):
  """Given a Beta model (2 parameters and time), find the closest Beta model at the halflife

  Works bmuch better than `moveGb1`"""
  th = estimate_half_life(model, 0.5)
  m = ebisu.predictRecall(model, th)
  v = ebisu.predictRecallVar(model, th)
  alpha1, beta1 = ebisu._meanVarToBeta(m, v)
  return (alpha1, beta1, th)

if __name__ == '__main__':
  model = (4., 30., 1.)
  model = (40., 6., 1.)
  model = (4., 6., 1.)

  def simulation(model, result, tnow):
    gold = ebisu.updateRecall(model, result, tnow)

    newModel = updateViaGb1(model, result, tnow)
    print(newModel)
    t = np.linspace(.1, 25, 200)

    def trace(model):
      return np.vectorize(lambda t: ebisu.predictRecall(model, t))(t)

    import pylab as plt
    plt.style.use('ggplot')
    plt.ion()
    plt.figure()
    plt.semilogy(t, trace(model), linewidth=5, label='prior')
    plt.semilogy(t, trace(gold), linewidth=4, label='post')
    plt.semilogy(t, trace(newModel['simple']), '--', linewidth=3, label='via GB1')
    plt.semilogy(t, trace(newModel['moved']), linewidth=2, label='via GB1@HL')
    plt.semilogy(t, trace(newModel['moved2']), '--', linewidth=1, label='Beta@HL')
    plt.legend()
    plt.title('Quiz={}, T={}'.format(result, tnow))

  simulation(model, True, 100.)
  simulation(model, False, 100.)

  simulation(model, True, 30.)
  simulation(model, False, 30.)

  simulation(model, True, 3.)
  simulation(model, False, 3.)

  simulation(model, True, .01)
  simulation(model, False, .01)
"""
For `model = (4., 6., 1.)`, 
- the "via GB1" model, which doesn't move the prior halflife (much) and
- the Beta@HL model, which telescopes the "via GB1" model to the halflife,
both agree very well for severe over/under-review, pass/fail quizzes.

Similarly for `model = (40, 6, 1)` and `(4., 30., 1.)`.

Telescoping is slightly less accurate extremely far in the future but...
"""
rkern commented 5 years ago

(I also found a negative result. Given a GB1 distribution, I tried searching for another GB1 distribution with the same moments but with a different a (time constant) parameter and where p=q (alpha=beta), as a way to maybe jointly estimate the halflife and the shape of the distribution there, and it works under benign cases but produces some very strange results in extreme cases. I include it here in case anyone else thinks it might be a good idea (or if they can find a flaw in my reasoning).)

Yeah, I don't think there's any benefit in that, per se. If you have a GB1 distribution, you can just evaluate its moments at its half-life and approximate the symmetric Beta there (i.e. moved2) with analytical formulae. The thing that you might want to do is to approximate the negative-recall posterior by estimating the symmetric beta (e.g. the two variables alpha(=beta) and delta) such that the moments match the negative-recall-posterior moments. That way, there's only one approximation rather than approximating the negative-recall posterior as a GB1 and then propagating it out to the approximation's half-life and re-approximating again as a symmetric Beta.

I suspect that moment-matching becomes inaccurate (at least via least-squares numerical optimization) in the low-/high-delta regimes as all the moments get very close to 1/0.

Playing around with the formulae, I'm increasingly convinced that in the negative-recall case, GB1 is not formally conjugate with a Bernoulli trial. I think the best we can do is find a numerically approximate GB1 distribution.

fasiha commented 5 years ago

Robert, would you be willing to releasing your estimate_half_life into the public domain? I'd like to use it as is in the next release of Ebisu, which is Unlicense'd.

Yesterday I did experiment with just what you suggested: fitting a symmetric Beta to the posterior after a failed quiz. Most of the time it produced very similar results as the existing telescoping half-life approach that you'd already investigated.

I've actually decided to not telescope the posterior to the half-life because of results like this: the plot below supposes you start with a high-beta model, and fail it, and look at the updated model's recall probabilities far in the future, for a variety of updates (no update, the basic GB1 update without telescoping, then telescoping to a Beta, and telescoping the GB1 posterior fit to a symmetric beta). Far into the future, the telescoped models give a higher recall probability than the prior after a failure, whereas the basic GB1 update behaves correctly (predicting lower recall probability than the prior). Admittedly this difference would make no practical difference since the probabilities involved are vanishingly small, but I'm getting heartburn trying to explain away this anomaly in the README 😝.

Recall probabilities after updating a model various ways

Stay tuned for the release drop!

rkern commented 5 years ago

Yes, unless otherwise noted, any code I have posted here or in my fork can be considered released under the same terms as fasiha/ebisu itself at the time of posting.

As for explaining it, as far as I'd be concerned, any absolute difference in probabilities less than 1% has no practical effect, especially when we are so close to 0% anyways. The difference between these ideal models and our "true" memory models is likely much, much larger.

fasiha commented 5 years ago

Robert!, it took me a month but I hope you'll agree that the result was worth it. It was the most serendipitous thing but as I was about to push the version matching what we'd last talked about, I was regenerating plots and saw that half-life after updating a failed quiz was higher than the previous half-life, which I didn't expect.

I pulled on that string and it turns out that you can put the failed quiz's posterior density function through another exponential decay to any other time horizon, not just the original time from the prior. Not only is that transformed posterior analytic, so are its moments, and they're gorgeous: I updated the doc but here's a screenshot:

Screen Shot 2019-06-28 at 22 17 19

I also added a version of your telescoping idea because I was encountering, under very extreme under-review (which can happen because I use 15 minutes as the initial half-life for newly learned items and I sometimes don't have a chance to review them for a couple of days), some numerical instabilities can pop up when keeping the updated models at the same time horizon as the prior. I also didn't want users to have to be sophisticated about picking a time for the updated models. So I tweaked the half-life-finding function you wrote to optionally return early with the result of the coarse window, and if the proposed posterior (back at the prior's time horizon) is too lopsided, I rerun the update for this approximate half-life.

I called this “rebalancing” in the code to emphasize that we’re balancing alpha and beta, and maybe I was a bit foolish in using the coarse result instead of using the finishing optimization (the rebalanacing idea is literally two hours old)…

A huge, huge thank you for this! I took a break to work on other projects after we last spoke above, but for the last few nights I've been working on this full-time and am happy to publish Ebisu 1.0 to PyPI, acknowledging my debt to your keen insights and time, not just code!

fasiha commented 5 years ago

PS. The failed-quiz posterior's half-life being greater than the prior's I tracked down to the fact that even the best-matching GB1 (found via numerical optimization) doesn't match the posterior that well. Using Monte Carlo, I was able to get very nice posterior Beta fits (whose half-lives were strictly less than or equal to the prior's) but couldn't get GB1s with that property. That's when I started playing with the posterior density itself and noticed I could transform it back to the prior's time horizon analytically, and then to any other time horizon I wanted.