Ebisu assumes that half-lives do not change after reviews

This is a summary of this reddit discussion we had some time ago, and I'm mostly posting this here so that:

Folks who look into ebisu can see that this is a known aspect of ebisu and can get some information on what this means for using it.
The discussion we had on Reddit doesn't disappear down the internet memory hole.

(You mentioned you'd open an issue about it, but I guess other things got in the way. :smile_cat:)

The main concern I have with ebisu at the moment is that it has an implicit assumption that the half-life of a card is a fundamental property of that card -- this means that, independent of how many times you reviewed a card, that card will be forgotten at approximately the same rate (note that because ebisu uses Bayes, this half-life does grow with each review but the fundamental assumption is still there). This has the net effect of causing you to do far more reviews than necessary (at least this is the case if you use it in an Anki-style application where you quiz cards that have fallen below a specific expected recall probability -- I'm not sure if ebisu used in its intended application would show you a card you know over a card you don't).

To use a practical metric, if you take a real Anki deck (with a historical recall probability of >80%) and apply ebisu to the historical review history, ebisu will predict that the half-life of the vast majority of cards has either already lapsed or the predicted recall is below 50%. In addition, if you construct a fake review history of cards that are always passed, ebisu will only grow the interval by ~1.3x each review. This is a problem because we know that Anki's (flawed) method of applying a 2.5x multiplier to the interval works (even for cards without perfect recall) so ebisu is clearly systematically underestimating the way the half-life of a card changes after a quiz.

In my view this is a flaw in what ebisu is trying to model -- by basing the model around a fundamental half-life quantity, ebisu is trying to model a second-order effect which varies with each review as a constant quantity. As discussed on Reddit, you had the idea that we should model the derivative of the half-life explicitly (which you called the velocity) -- in Anki terminology this would be equivalent to modelling the ease factor explicitly. I completely agree this would be a far more accurate model, since it seems to me that the ease factor of a card is a far more stable quantity that is more of an intrinsic factor of the card (it might be the case that the ease factor evolves as a card moves to long-term memory but at the least it should be a slowly-varying quantity).

This was your comment on how we might do this:

I'm seeing if we can adapt the Beta/GB1 Bayesian framework developed for Ebisu so far to this more dynamic model using Kalman filters: the probability of recall still decays exponentially but now has these extra parameters governing it that we're interested in estimating. This will properly get us away from the magic SM-2 numbers that you mention.

(Sci-fi goal: if we get this working for a single card, we can do Bayesian clustering using Dirichlet process priors on all the cards in a deck to group together cards that kind of age in a similar manner.)

I'll be creating an issue in the Ebisu repo and tagging you as this progresses. Once again, many thanks for your hard thinking and patience with me!

(I am completely clueless about Kalman filters, and I honestly struggled to understand the Beta/GB1 framework so sadly I'm not sure I can be too much of a help here. Maybe I should've taken more stats courses.)

Thanks for opening an issue, and for raising the initial issue and chatting about it on Reddit! I haven't forgotten about this! I've been trying a few different ways to achieve the goal of evolving the halflife or the probability of recall, with a straightforward statistical model and compact predict/update equations, but haven't been able to make much progress yet.

if you take a real Anki deck (with a historical recall probability of >80%) and apply ebisu to the historical review history, ebisu will predict that the half-life of the vast majority of cards has either already lapsed or the predicted recall is below 50%.

To elaborate on this for future readers, another way to see this is—take a representative flashcard from a real Anki deck, where you've passed a bunch of quizzes and failed a few, and try to fit a maximum likelihood estimate for the initial model: either the full 2D case (find the alpha=beta and the initial halflife variables that maximize likelihood) or the simpler 1D case (fix alpha=beta=2 say and find the initial halflife to maximize likelihood). You'll find that the initial halflife of the maximum likelihood estimate is like, 10'000 hours, which is obviously wrong.

The math isn't wrong, the underlying model is broken. The easiest way to see why is, ignore exponential decay of memory for now, and assume you take a quiz at the exact halflives. Then the model simplifies to a Beta random variable that just accumulates the number of successes and failures—the classic conjugate prior to Bernoulli trials. If quizzes at halflives were coin flips of weighted coins, Ebisu estimates the weight of that coin; but memory isn't a fixed coin flip: the weight of the coin (the strength of recall just after each quiz) changes with the number of quizzes.

The goal for an improved model is to explicitly track the strength of the memory over time. It's unclear whether the Beta/Bernoulli model has a place in this. You can imagine a random process (e.g., white noise Gaussian process) trying to estimate this hidden memory strength as reviews come in, but I'm struggling to separate the natural evolution of the memory strength (independent of reviews) with the exponential decay after each review. I will update this thread when I have something solid! I am happy to get proposals for algorithms too!

For Ebisu users, this explains why you can't just schedule reviews when cards drop to 80% or 50%: in the past I thought those predictions weren't realistic because we were very handwavy about our initial models, but actually it's because of the issue above.

If you are using Ebisu now, you most likely already have a workaround, perhaps reviewing when recall drops to 10% or reporting recall probability with a visual meter (see here), and that will continue to work!

If you are evaluating Ebisu, know that the output of predictRecall will not match real-world empirical pass rate except for cards that are "mature" (when the memory strength has plateaued).

I am still using Ebisu to power my apps, but my workflow is significantly different than Anki. I don't schedule flashcards at all. I use recall probability to rank cards to find the cards most at risk of being forgotten during my study sessions. So while predictRecall's output isn't absolutely reliable, relative to other cards, it's fine.

A quick update—

I've basically found a solution, a Bayesian implementation of what @cyphar above calls "ease", i.e., the growth of the halflife as a function of review. In a nutshell, the idea is that, as part of the updateRecall step after each quiz, Ebisu can bump the halflife of the quiz by a certain factor. Since we like to be Bayesian, the ease is modeled as a Gamma random variable and updated along with recall.

The good: Ebisu behavior on real cards looks much better 😁
The bad: the posteriors are no longer analytical so we have to do Monte Carlo 🤬

I'm really disappointed that I might have to change the Ebisu update process to use Monte Carlo—a big part of what Ebisu meant to me was fully analytical updates. But I'm slowly making peace with this—moving to Monte Carlo actually makes everything a lot simpler and may give us a lot of powerful tools to do things we've wanted to do.

For example—one of those things is how we model correlations between flashcards. We know that there are pairs of cards where, if you review one, then the other is (very) likely to be a success. How do we detect those?

We also know that flashcards can have a lot of metadata associated with them—Duolingo's halflife regression paper got all of us excited about the possibilities here. Often this metadata lets us predict recall for cards that the user hasn't even studied.

Instead of rushing out a big Ebisu version that explicitly models each card's ease (I was calling it "boost") as a Bayesian parameter, I'm spending some time experimenting with how to make Ebisu more general to accommodate things like correlations and metadata, which are now a lot easier to handle since we're using Monte Carlo. I'm hoping to timebox that effort though, so if I don't make any concrete breakthroughs there, I'll try to release that new Ebisu version that accounts for ease so folks can start using it.

Another quick update. I think I've mostly finished the math and the code for the new version of Ebisu, and I'm working on tests that's exposing various bugs—hopefully there's no more unknown unknowns.

I'm hoping to post an RFC with the new API in a few weeks.

When I posted my last comment in September, I was afraid we'd have to use Monte Carlo, but luckily, there's a simpler and less computationally-intensive way to handle updates, via MAP (maximum a posteriori) estimates of the halflife and the boost factor (Anki's ease factor).

As a reminder, the new version will:

have a more accurate predictRecall because it explicitly models the strengthening of memory as you review (ease/boost factor). I'm testing it against cards in my old Anki database, and here's an example card: it has two failures, one at 60% recall probability and another at 70%. It also has successful reviews with weaker probability (notably one at 2% probability).

result @ elapsed (hours)	halflife	pRecall
3@23.38	15.6 hours	0.22
3@118.24	30.8 hours	0.02
2@22.42	60.8 hours	0.69
2@76.28	66.6 hours	0.32
2@192.44	131.4 hours	0.23
2@119.20	259.3 hours	0.63
3@347.87	316.8 hours	0.33
2@338.48	625.1 hours	0.58
2@420.03	834.9 hours	0.60
2@841.76	1070.6 hours	0.46
1@964.46 🔥	1794.1 hours	0.58
4@15.33	1794.1 hours	0.99
4@118.73	1794.1 hours	0.94
3@314.67	1794.1 hours	0.84
1@668.11 🔥	1794.1 hours	0.69
4@26.48	1794.1 hours	0.99
3@108.05	1794.1 hours	0.94
4@128.55	1794.1 hours	0.93
4@411.86	1794.1 hours	0.79
4@1391.14	1794.1 hours	0.46
3@4992.25	2979.6 hours	0.19

predictRecall is going to be much faster. I place the Bayesian prior on halflife, and approximate the true Bayesian probability of recall with just -(now - lastSeen) / meanHalflife (this is actually the log-probability, take exp to get true probability). You can do this in SQL.
The model is going to be a bigger JSON object with keys and values, because we need to track more things in order to get Bayesian estimates of the boost/ease factor.
Support for binomial quizzes and noisy binary quizzes (same as current Ebisu).
Support (at least partially) for partial reconsolidation per #51.

Timeline: RFC in a few weeks, release in another few weeks (documentation takes me as long to write as doing mathematical analysis…).

Edit: the progress is happening in another repo: https://github.com/fasiha/ebisu-likelihood-analysis/commits/main

This thread cleared some things up for me. I never did understand how ebisu was modeling the increase in half-lives due to reviews. I thought I just wasn't understanding the algorithm well enough. I also don't schedule flashcards at all; ebisu tells me what to study, but not when to study it. So I guess I'm not that affected :-)

I'm looking forward to seeing your new version.

fasiha / ebisu

Ebisu assumes that half-lives do not change after reviews #43