fasiha / ebisu

Public-domain Python library for flashcard quiz scheduling using Bayesian statistics. (JavaScript, Java, Dart, and other ports available!)
https://fasiha.github.io/ebisu
The Unlicense
314 stars 32 forks source link

Request for comment: a new Ebisu v3 API, the Gamma ensemble #62

Open fasiha opened 1 year ago

fasiha commented 1 year ago

Ebisu v3 request for comment (RFC): Gamma ensemble

This RFC supersedes https://github.com/fasiha/ebisu/issues/58 because it is simpler and faster in terms of code, mathematically more compact and principled, and just as accurate if not more accurate.

Background

This RFC describes a proposal for Ebisu version 3 informally called "Gamma ensemble".

The current Ebisu version 2 posits a Beta distribution on the recall probability of a flashcard after a fixed time has elapsed since last study (so we can call this "Beta-on-recall"). So each flashcard is represented by a 3-tuple, (a, b, t) where a and b represent parameters of the Beta distribution and t is the time elapsed after which that Beta is believed to match the recall probability.

At any given time in the future, you can call ebisu.predictRecall to get the predicted recall probability for this flashcard. If you present this flashcard to the student, you then call ebisu.updateRecall with the quiz results, e.g., boolean (pass/fail); binomial (got 2 out of 3 right); or noisy-binary (got it right with a 99% probability of the student getting it right if they really know it, and 25% probability of the student "getting it right" when they've actually forgotten it).

The problem with this Beta-on-recall (see #43 and references therein, many thanks to @cyphar and others) is that its predicted recall is too pessimistic, due to a very specific modeling flaw: Ebisu v2 treated the probability of recall as a unknown but static random variable whose parameters it updated after each quiz. However, we know that memory strengthens as students review—memory is more dynamic than Ebisu v2 allowed for.

If you build flashcard apps that used Ebisu v2 to just sort flashcards from most-to-least likely to forget, I believe Ebisu v2 would yield reasonable flashchard schedules. However, many flashcard app developers still prefer scheduling a review when the predicted recall probability drops below some threshold (e.g., 80% or 50%, etc.), because this is very simple, very familiar (Anki and SuperMemo and Leitner and many other techniques do this), and because their apps don't model inter-card confusion/interference.

Work began on Ebisu v3 to fix this shortcoming. I wanted to keep the mathematical elegance, analytical correctness, implementation simplicity, and intuitive clarity while more accurately estimating recall probabilities as time passed and as memories strengthened.

My first attempt https://github.com/fasiha/ebisu/issues/58 (which maybe can be called the boosted-Gamma model) I rejected because, aside from being relatively slow (Monte Carlo) and needing a number of carefully-tuned parameters, it exhibited complex and unintuitive behavior where for example failing a review might increase the expected halflife of a flashcard, because of the interaction between modeling halflife and boost jointly.

The new v3 algorithm: Gamma ensemble

The new proposal is simple: whereas v2 kept track of a single random variable, let's track a weighted ensemble of random variables.

To make it easier to approximate the recall probability (in SQL, etc.), let's have the random variables be on halflives (in https://github.com/fasiha/ebisu/issues/32 via discussion with @ttencate we derived the math for Gamma-distributed halflives, and that's what I propose here).

That's about it. No prizes for novelty or originality! This is about as straightforward a solution as one can imagine. There are a lot of details to iron out, but in a nutshell, instead of just saying "A priori I guess that this fact's halflife has mean 24 hours, with this much variance" like you did with Ebisu v2, now you get to say, "A priori I guess that this fact's halflife has mean 24 hours and this variance, but let's just assign 90% weight to that, and then distribute the rest of the weight between 240 hours (ten days), 2400 hours (three months), 24,000 hours (2.7 years), and 240,000 hours (27 years)". That is, we have an ensemble of five random variables, with logarithmically-increasing halflives.

By logarithmically spacing out halflives like this, each with a decreasing weight, we can now model the possibility that this flashcard, after a little practice, might last a good year before being forgotten. That is, this ensemble allows power law decay, which considerable psychology research has shown governs memory.

Details

This v3 proposal definitely has more ad hoc parts of it than v2 did. I still think a lot of this remains quite principled and analytically sound, but there are a few places where I decided to do something because it seems to work and I don't have much justification for it. This means that in the future, we might find better ways to do some of these things. I'll call out the ad hoc bits as we go.

So the core of Ebisu stays the same. We still have binary/binomial quizzes (number of points out of total points allowed) and noisy-binary quizzes (what's the probability the student got this right-vs-wrong assuming they truly know the fact). We still have two phases, a predictRecall function that tells you the recall probability of a fact right now and and updateRecall function that modifies the model's parameters after a quiz.

Let's review how these two phases work in an ensemble.

At any given time, the ensemble's recall probability is a weighted power mean of the recall probability of each atom—each component of the ensemble. Each atom's probability of recall is just 2**(-currentTime / halflife), which is an approximation to the true Bayesian probability but it's close enough, and this is easily doable in SQL, etc.

Technical sidebar. We use a weighted power mean (see definition: "for a sequence of positive weights…") instead of a simpler weighted average because in practice, when some atoms predict a very low recall probability but other atoms predict higher recall probability, a simple weighted average often can't overcome the very low predictions. I love using a power mean here because it's more like a "weighted maximum": in practice, it's able to ignore the atoms predicting near-zero recall probability and allow the atoms predicting higher recall probability to express themselves.

Now let's talk about updates. When a quiz happens, each atom's Gamma distribution is updated, assuming its weight is large enough. (By only updating the atoms with some weight, we avoid situations like, you failing this flashcard a week after learning it should intuitively not affect that low-weight 27 year atom right? You don't want that Gamma distribution with mean 27 years to be reduced to 2 years because of this failure that it thinks is very unexpected. This is ad hoc! But it works well.)

Updating each atom is quite similar to Ebisu v2 except instead of updating a Beta random variable on recall probability like in v2, we update a Gamma random variable on halflife (we talked about this in 2020! https://github.com/fasiha/ebisu/issues/32). As in v2, the Bayesian posterior's moments can be computed analytically, and we match the first two moments to a new Gamma random variable. This is pretty efficient: it does require computing the modified Bessel function of the second kind, which is available in Scipy and can readily be ported to WebAssembly via f2c and Emscripten (see bessel-via-amos.js).

However, whether or not you update each atom's Gamma random variable, you do update each atom's weight on every quiz. We use a pretty obvious weight update: just like particle filters, the old weight is just scaled by the likelihood of observing this result. This gives us some nice properties like, failures strictly decrease the ensemble's halflife and successes strictly increase it.

In this way, v3 circumvents the rigidity of v2. As your memory strengthens or weakens over time, the weights update seamlessly in response to the probability of each quiz result under that atom's model. We always have low-weight atoms with long halflives ready to step up and prop up the recall probability as your memory improves and you go longer and longer between quizzes. We don't have to explicitly track interval factors like Anki, we don't move flashcards between bins like Leitner, the power law memory model is tracked largely within this ensemble-Bayesian approach.

Summary

In Ebisu v2, you only had to make two decisions: (1) what's your a priori guess of halflife and (2) how uncertain are you.

With this proposed v3 you have more decisions to make:

  1. What's your halflife guess and,
  2. how uncertain are you about it?
  3. How many atoms do you want and how spaced out do you want to make them? By default, Ebisu will make five atoms, i.e., five Gamma random variables, each with mean each 10× bigger than the previous one, and the same amount of uncertainty (standard deviation, per unit mean). You can ask it for more or fewer and specify a different logarithmic spacing.
  4. How much weight do you assign to your first atom? By default Ebisu will default to 90%, but whatever you pick, it'll apportion the remaining weight (by default 10%) logarithmically among the remaining atoms, so each successive atom with successively longer halflife gets logarithmically less weight. All weights will sum to one so the initial weights are fully determined by the weight you give to the first atom.
  5. Technically you can tweak what power to use in the weighted power mean. Ebisu by default picks 20, because that seems to be a nice round number and seems to work well.
  6. What minimum weight must an atom have before you apply a quiz update? Ebisu defaults to 0.05, again because it's a nice round number and seems to work well.

What comes next

The code for this is largely written. The atomic Gamma updates are highly well-tested—the math and code are reliable. The ensemble extension is something I've spent a lot of time tweaking and reworking, and it needs some more tests but seems to work well.

When I say "works well", I mean I've been comparing the previous v3 proposal (https://github.com/fasiha/ebisu/issues/58) and this proposal with different values for parameters against a few hundred Anki flashcards' quiz data I have over a couple of years. The quiz times follow Anki's SM2 scheduling but I'm able to benchmark algorithms via likelihood: for a given algorithm and parameters set, I can evaluate the probability it assigned to each quiz outcome and sum those up. A perfect oracle will always predict 100% recall probability whenever a quiz passed and 0% recall probability on each failure. A real algorithm will vary, so you can compare algorithms and parameter sets objectively this way over lots of flashcards and see what works well over a diverse variety of flashcards (hard ones, easy ones, etc.). This validation framework with Anki history will be made avaiable for you to play with. The default parameters described above are the result of experimentation and comparing what probabilities each parameter set assigned to each quiz result over a variety of flashcards.

I know everyone has given up on me keeping any kind of timeline with this effort, but my hope is that in the coming weeks I'll finish tests, update the README, and prepare a 3.0-rc1 release candidate for Python.

If anyone is interested, the code is still in this branch: start with https://github.com/fasiha/ebisu/blob/v3-leaky-integrators/ankiCompare.py and go from there.

As mentioned above, the Gamma Bayesian machinery does require a weird mathematical function, the modified Bessel function of the second kind (kv and kve). This is pretty niche unfortunately. Scipy implements this by wrapping a Fortran library called AMOS, and this library can be successfully compiled to WebAssembly for JavaScript. However, for JVM and Go and Dart and other runtimes, we'll have to either find a library or somehow port the intricate Fortran code to them. You might argue that this is enough reason to just use the Beta-on-recall model and extend it to the ensemble. It might be (sorry if this results in more delays!).

There's at least a couple of nice mathematical extensions that are within our grasp. Right now after each quiz, we analytically compute the moments of the posterior and then moment-match to the nearest Gamma distribution. There's a loss of accuracy each time we do this. However, we can actually combine posteriors of multiple quizzes and compute the moments of them all. This could result in a nice boost in accuracy—it's unclear whether this would materially affect predicted recall probabilities or updates, but even if it didn't, we'd have confidence in the algorithm.

I also did a lot of work to allow, instead of Gamma random variables, Weibull random variables (see here and linked posts). The Weibull distribution allows power-law-like fat tails for certain parameters which might be a useful extension. Using an ensemble of Gammas with very long tails doesn't seem to help algorithm performance, so I stopped working on this.

Thanks to everyone for bearing with me on this long journey. Hopefully it's coming to an end soon.

ambersz commented 1 year ago

You've said the problem with v2 was that memory strength is treated as some unknown static value. I might be misreading something here, but doesn't the atomic/ensemble model still fit that description? Does it just get around the bad effects that that had on halflife estimation in v2 because it can react to changing memory strength faster?

fasiha commented 1 year ago

doesn't the atomic/ensemble model still fit that description? Does it just get around the bad effects that that had on halflife estimation in v2 because it can react to changing memory strength faster?

You got it! Yes, you’re 100% right. The ensemble method just combines several strategically-placed static models to give the appearance/performance of a dynamic-memory model, but doesn’t have anything dynamic built into it like #58‘s boost model.

I like the way you put it, “it can react to changing memory strength faster” because indeed another way to fix v2 Ebisu’s problem is to use a probability distribution that’s far more adaptive to quizzes than the Beta-on-recall model. I experimented with a single atom on half-life with a fat-tailed prior like the bounded Pareto and the Weibull and power law distributions, and they work pretty well in terms of likelihood (the downside is they’re largely Monte Carlo-based, with posteriors with hard-to-find moments 😔)—these priors allow for a lot of density at the tails (at very long half-lives) and could as you say react faster to quizzes of increasing intervals.

I don’t think it’s a problem that the ensemble method achieves the illusion of dynamicity? The predicted recalls are reasonable, apps that want to schedule due dates based on probability of recall can use 80% as the recall threshold, etc. Is there something I might be overlooking, that might suggest we need a truly dynamic model?

ambersz commented 1 year ago

Ohh I see, thanks for the explanation! One more question then.... Since we want to have the models adapt faster in reaction to quizzes, then could I "patch" usage of v2 by just decreasing the relative weight past reviews have on the model? Something like: On the nth review, update the model as if you answered 2^n independent reviews all with a success/failure. Or maybe before updating the model, scale the Beta parameters down in a way that holds the mean constant but increases the variance?

fasiha commented 1 year ago

could I "patch" usage of v2 by just decreasing the relative weight past reviews have on the model?

Riiight, I don't think it'd help to deweight old reviews but your approach to increasing the weight of new quizzes by scaling their counts (number of successes) feels valid, as far as ad hoc customizations go. It'll be a bit of a challenge to decide how to balance the quiz number n versus calendar time between quizzes, but you can probably come up with something creative? I think a mixture of Betas (or Gammas, or the boosted-Gamma model of #58 😅) is more principled than this though?

I'm delaying releasing a v3 release candidate because I'm now investigating what happens if we use a mixture of Betas (the very very straightforward extension of v2). The advantage of a mixture of Betas with increasing halflives is, like v2, we just need the Γ function, whereas a Gamma-on-halflife (as proposed above) needs the modified Bessel function of the second kind 😳 which seems non-trivial to port to Go or Rust or whatever. Hoping to wrap this analysis up soon and release v3 with the mixture-of-Gammas proposal above or adapted to Betas!

fasiha commented 1 year ago

I'm now investigating what happens if we use a mixture of Betas (the very very straightforward extension of v2)

Some news! The mixture-of-Betas-on-recall (where each atom is a v2 model) is pretty much equivalent to the mixture-of-Gammas-on-halflife (where each atom is a Gamma distribution), assuming you keep all the other parameters the same (like starting halflife, atoms' weights, weight threshold for update, etc.). The predicted halflives are quite similar, as are the individual quiz predictions so in log-likelihood terms the two appear to be interchangeable.

This is a really nice theoretical result! It's somewhat unexpected: I thought it'd be a lot of work to align the variances of the Beta atoms with the Gamma atoms to get this kind of parity but I just use sensible defaults and both algorithms behave quite similarly.

So we have a choice: should v3 ship with a mixture of Betas or Gammas?

Comparing approximation vs exact Bayesian predict recall for mixture-of-Betas versus mixture-of-Gammas, for models of various overall maturity

This chart shows the ratio of the SQL-friendly approximation to predictRecall versus the full-Bayesian predictRecall, for the Betas and Gammas approaches, for a sequence of models (from the same flashcard's review history). Each line corresponds to a certain model (output of initModel or updateModel).

What this shows is for mixture-of-Betas, the approximation is really good till up to 2x the due date ("due date" = time till approximate predictRecall drops to 80%), then degrades from the true value quite dramatically, dropping to 0.8x the true value by roughly 10x to 20x the due date, and then continuing to break down.

Meanwhile, for mixture-of-Gammas, the approximation is 1 to 1.2x the real value, over the entire range of the model support (past 10^5 hours). So in this case, the approximation is bad up front (for most of the time we expect to call predictRecall) and stays "just bad" most of the time.

I'm very much leaning towards releasing v3 as a mixture-of-Betas, even though it means I'd be throwing away all the math we did for Gamma priors on halflives 😆