fasiha / ebisu

Public-domain Python library for flashcard quiz scheduling using Bayesian statistics. (JavaScript, Java, Dart, and other ports available!)
https://fasiha.github.io/ebisu
The Unlicense
303 stars 32 forks source link

Ebisu assumes that half-lives do not change after reviews #43

Open cyphar opened 3 years ago

cyphar commented 3 years ago

This is a summary of this reddit discussion we had some time ago, and I'm mostly posting this here so that:

  1. Folks who look into ebisu can see that this is a known aspect of ebisu and can get some information on what this means for using it.
  2. The discussion we had on Reddit doesn't disappear down the internet memory hole.

(You mentioned you'd open an issue about it, but I guess other things got in the way. :smile_cat:)

The main concern I have with ebisu at the moment is that it has an implicit assumption that the half-life of a card is a fundamental property of that card -- this means that, independent of how many times you reviewed a card, that card will be forgotten at approximately the same rate (note that because ebisu uses Bayes, this half-life does grow with each review but the fundamental assumption is still there). This has the net effect of causing you to do far more reviews than necessary (at least this is the case if you use it in an Anki-style application where you quiz cards that have fallen below a specific expected recall probability -- I'm not sure if ebisu used in its intended application would show you a card you know over a card you don't).

To use a practical metric, if you take a real Anki deck (with a historical recall probability of >80%) and apply ebisu to the historical review history, ebisu will predict that the half-life of the vast majority of cards has either already lapsed or the predicted recall is below 50%. In addition, if you construct a fake review history of cards that are always passed, ebisu will only grow the interval by ~1.3x each review. This is a problem because we know that Anki's (flawed) method of applying a 2.5x multiplier to the interval works (even for cards without perfect recall) so ebisu is clearly systematically underestimating the way the half-life of a card changes after a quiz.

In my view this is a flaw in what ebisu is trying to model -- by basing the model around a fundamental half-life quantity, ebisu is trying to model a second-order effect which varies with each review as a constant quantity. As discussed on Reddit, you had the idea that we should model the derivative of the half-life explicitly (which you called the velocity) -- in Anki terminology this would be equivalent to modelling the ease factor explicitly. I completely agree this would be a far more accurate model, since it seems to me that the ease factor of a card is a far more stable quantity that is more of an intrinsic factor of the card (it might be the case that the ease factor evolves as a card moves to long-term memory but at the least it should be a slowly-varying quantity).

This was your comment on how we might do this:

I'm seeing if we can adapt the Beta/GB1 Bayesian framework developed for Ebisu so far to this more dynamic model using Kalman filters: the probability of recall still decays exponentially but now has these extra parameters governing it that we're interested in estimating. This will properly get us away from the magic SM-2 numbers that you mention.

(Sci-fi goal: if we get this working for a single card, we can do Bayesian clustering using Dirichlet process priors on all the cards in a deck to group together cards that kind of age in a similar manner.)

I'll be creating an issue in the Ebisu repo and tagging you as this progresses. Once again, many thanks for your hard thinking and patience with me!

(I am completely clueless about Kalman filters, and I honestly struggled to understand the Beta/GB1 framework so sadly I'm not sure I can be too much of a help here. Maybe I should've taken more stats courses.)

fasiha commented 3 years ago

Thanks for opening an issue, and for raising the initial issue and chatting about it on Reddit! I haven't forgotten about this! I've been trying a few different ways to achieve the goal of evolving the halflife or the probability of recall, with a straightforward statistical model and compact predict/update equations, but haven't been able to make much progress yet.

if you take a real Anki deck (with a historical recall probability of >80%) and apply ebisu to the historical review history, ebisu will predict that the half-life of the vast majority of cards has either already lapsed or the predicted recall is below 50%.

To elaborate on this for future readers, another way to see this is—take a representative flashcard from a real Anki deck, where you've passed a bunch of quizzes and failed a few, and try to fit a maximum likelihood estimate for the initial model: either the full 2D case (find the alpha=beta and the initial halflife variables that maximize likelihood) or the simpler 1D case (fix alpha=beta=2 say and find the initial halflife to maximize likelihood). You'll find that the initial halflife of the maximum likelihood estimate is like, 10'000 hours, which is obviously wrong.

The math isn't wrong, the underlying model is broken. The easiest way to see why is, ignore exponential decay of memory for now, and assume you take a quiz at the exact halflives. Then the model simplifies to a Beta random variable that just accumulates the number of successes and failures—the classic conjugate prior to Bernoulli trials. If quizzes at halflives were coin flips of weighted coins, Ebisu estimates the weight of that coin; but memory isn't a fixed coin flip: the weight of the coin (the strength of recall just after each quiz) changes with the number of quizzes.

The goal for an improved model is to explicitly track the strength of the memory over time. It's unclear whether the Beta/Bernoulli model has a place in this. You can imagine a random process (e.g., white noise Gaussian process) trying to estimate this hidden memory strength as reviews come in, but I'm struggling to separate the natural evolution of the memory strength (independent of reviews) with the exponential decay after each review. I will update this thread when I have something solid! I am happy to get proposals for algorithms too!


For Ebisu users, this explains why you can't just schedule reviews when cards drop to 80% or 50%: in the past I thought those predictions weren't realistic because we were very handwavy about our initial models, but actually it's because of the issue above.

If you are using Ebisu now, you most likely already have a workaround, perhaps reviewing when recall drops to 10% or reporting recall probability with a visual meter (see here), and that will continue to work!

If you are evaluating Ebisu, know that the output of predictRecall will not match real-world empirical pass rate except for cards that are "mature" (when the memory strength has plateaued).

I am still using Ebisu to power my apps, but my workflow is significantly different than Anki. I don't schedule flashcards at all. I use recall probability to rank cards to find the cards most at risk of being forgotten during my study sessions. So while predictRecall's output isn't absolutely reliable, relative to other cards, it's fine.

fasiha commented 2 years ago

Posted an interim detailed update at https://github.com/fasiha/ebisu/issues/35#issuecomment-899252582.

fasiha commented 2 years ago

A quick update—

I've basically found a solution, a Bayesian implementation of what @cyphar above calls "ease", i.e., the growth of the halflife as a function of review. In a nutshell, the idea is that, as part of the updateRecall step after each quiz, Ebisu can bump the halflife of the quiz by a certain factor. Since we like to be Bayesian, the ease is modeled as a Gamma random variable and updated along with recall.

I'm really disappointed that I might have to change the Ebisu update process to use Monte Carlo—a big part of what Ebisu meant to me was fully analytical updates. But I'm slowly making peace with this—moving to Monte Carlo actually makes everything a lot simpler and may give us a lot of powerful tools to do things we've wanted to do.

For example—one of those things is how we model correlations between flashcards. We know that there are pairs of cards where, if you review one, then the other is (very) likely to be a success. How do we detect those?

We also know that flashcards can have a lot of metadata associated with them—Duolingo's halflife regression paper got all of us excited about the possibilities here. Often this metadata lets us predict recall for cards that the user hasn't even studied.

Instead of rushing out a big Ebisu version that explicitly models each card's ease (I was calling it "boost") as a Bayesian parameter, I'm spending some time experimenting with how to make Ebisu more general to accommodate things like correlations and metadata, which are now a lot easier to handle since we're using Monte Carlo. I'm hoping to timebox that effort though, so if I don't make any concrete breakthroughs there, I'll try to release that new Ebisu version that accounts for ease so folks can start using it.

fasiha commented 2 years ago

Another quick update. I think I've mostly finished the math and the code for the new version of Ebisu, and I'm working on tests that's exposing various bugs—hopefully there's no more unknown unknowns.

I'm hoping to post an RFC with the new API in a few weeks.

When I posted my last comment in September, I was afraid we'd have to use Monte Carlo, but luckily, there's a simpler and less computationally-intensive way to handle updates, via MAP (maximum a posteriori) estimates of the halflife and the boost factor (Anki's ease factor).

As a reminder, the new version will:

result @ elapsed (hours) halflife pRecall
3@23.38 15.6 hours 0.22
3@118.24 30.8 hours 0.02
2@22.42 60.8 hours 0.69
2@76.28 66.6 hours 0.32
2@192.44 131.4 hours 0.23
2@119.20 259.3 hours 0.63
3@347.87 316.8 hours 0.33
2@338.48 625.1 hours 0.58
2@420.03 834.9 hours 0.60
2@841.76 1070.6 hours 0.46
1@964.46 🔥 1794.1 hours 0.58
4@15.33 1794.1 hours 0.99
4@118.73 1794.1 hours 0.94
3@314.67 1794.1 hours 0.84
1@668.11 🔥 1794.1 hours 0.69
4@26.48 1794.1 hours 0.99
3@108.05 1794.1 hours 0.94
4@128.55 1794.1 hours 0.93
4@411.86 1794.1 hours 0.79
4@1391.14 1794.1 hours 0.46
3@4992.25 2979.6 hours 0.19

Timeline: RFC in a few weeks, release in another few weeks (documentation takes me as long to write as doing mathematical analysis…).

Edit: the progress is happening in another repo: https://github.com/fasiha/ebisu-likelihood-analysis/commits/main

eshapard commented 2 years ago

This thread cleared some things up for me. I never did understand how ebisu was modeling the increase in half-lives due to reviews. I thought I just wasn't understanding the algorithm well enough. I also don't schedule flashcards at all; ebisu tells me what to study, but not when to study it. So I guess I'm not that affected :-)

I'm looking forward to seeing your new version.