fasiha / ebisu

Public-domain Python library for flashcard quiz scheduling using Bayesian statistics. (JavaScript, Java, Dart, and other ports available!)
https://fasiha.github.io/ebisu
The Unlicense
314 stars 32 forks source link

Dev diary: split 3-atom model #66

Open fasiha opened 9 months ago

fasiha commented 9 months ago

This dev diary is the third open proposal for Ebisu v3:

Both the above techniques share two nice desiderata:

  1. realistic predictions on recall probability (Ebisu v2 was always too conservative, often predicting <1% probability of a successful quiz)
  2. realistic strengthening of a memory model after successful quiz (Ebisu v2's post-quiz halflife update was also very conservative)

But there's another desideratum:

  1. Repeatedly quizzing at the same quiz interval shouldn't dramatically increase the halflife. I.e., from Mozer et al.'s (DOI), taking the same quiz one week apart over and over again should not result in a memory halflife of months or years. The student has only demonstrated (very clearly!) that they can remember the same fact a week apart, and their memory for longer periods hasn't yet been tested.

Unfortunately, both the ensemble and the Beta-power-law approaches mentioned above fail miserably on this third requirement.

Code to generate the table below Starting with https://github.com/fasiha/ebisu/tree/v3-release-candidate run this in the top-level directory: ```py import ebisu m1 = ebisu.initModel(100) for i in range(20): print(i, ',', ebisu.modelToPercentileDecay(m1 := ebisu.updateRecall(m1, 1, 1, 100))) ``` and this in the `scripts/` directory to access the `betapowerlaw.py` script: ```py import betapowerlaw as bp m2 = [1.25, 1.25, 100] for i in range(20): print(bp.modelToPercentileDecay(m2 := bp.updateRecall(m2, 1, 1, 100))) ```
index Ensemble halflife (hours) Beta powerlaw halflife (hours)
0 185 236
1 267 463
2 359 842
3 458 1476
4 562 2535
5 668 4307
6 773 7269
7 877 12220
8 980 20497
9 1081 34332
10 1180 57459
11 1278 96119
12 1375 160741
13 1470 268762
14 1565 449324
15 1660 751140
16 1753 1255635
17 1847 2098912
18 1941 3508465
19 2034 5864550

I can explain why both models have this flaw:

These two failure modes are independent and made me think about ways to circumvent both while keeping the other two desiderata listed above.

Here's where I ended up.

Consider a simple 3-atom ensemble with fixed weights (i.e., the weights don't change, so it's quite a stretch to call it an "ensemble"):

  1. a primary Ebisu v2 atom (a Beta distribution on recall at some halflife, with exponential decay)
  2. a strengthening atom, also Ebisu v2, with a halflife 2× (or N×) the primary; this model is never updated
  3. a long-term atom, a power-law that for now can be the betapowerlaw model proposed in the previous dev diary; this model is also never updated

Here's the idea: the primary atom is just an Ebisu v2 atom, so it's conservative: it evolves slowly and therefore is less vulnerable to the halflife growing dramatically after repeated quizzes on the same time interval. The second atom allows this model to circumvent the conservativeness of Ebisu v2: it explicitly posits that memory can strengthen organically and its halflife is pegged to twice (or N×) the first atom's halflife: this meets our second desideratum of realistic halflife growth after quizzes, and that's why it never needs updating. Finally, the third atom (the power law) makes explicit the chance that exponential decay is just wrong for this memory and captures the odds that without study the student will remember this fact for a year. This achieves the first desideratum of respectable predicted recall probabilities, and similarly doesn't need updating: it just exists to prop up the recall probability at long intervals.

Here are the halflives for the three proposals after twenty successful quizzes each 100 hours apart, as well as how much bigger this halflife is than the starting halflife: the last column, the split approach, shows unbounded growth of the halflife but much slower. After twenty iterations, it's still 7× the starting halflife, versus 17 (ensemble) and 600 (Beta power-law):

Ensemble halflife Powerlaw halflife Split halflife
185.8 (1.59x) 168.4 (1.68x) 426.8 (1.38x)
267.2 (2.28x) 259.1 (2.59x) 538.8 (1.75x)
359.2 (3.07x) 379.7 (3.80x) 645.5 (2.09x)
458.8 (3.92x) 540.3 (5.40x) 747.5 (2.43x)
562.6 (4.80x) 754.3 (7.54x) 845.7 (2.74x)
668.0 (5.70x) 1039.6 (10.40x) 940.4 (3.05x)
773.3 (6.60x) 1419.7 (14.20x) 1032.0 (3.35x)
877.5 (7.49x) 1926.4 (19.26x) 1121.0 (3.64x)
980.2 (8.37x) 2601.6 (26.02x) 1207.5 (3.92x)
1081.1 (9.23x) 3501.7 (35.02x) 1291.8 (4.19x)
1180.5 (10.08x) 4701.2 (47.01x) 1374.0 (4.46x)
1278.4 (10.92x) 6300.0 (63.00x) 1454.4 (4.72x)
1375.1 (11.74x) 8430.8 (84.31x) 1533.1 (4.98x)
1470.8 (12.56x) 11270.6 (112.71x) 1610.3 (5.23x)
1565.8 (13.37x) 15055.4 (150.55x) 1685.9 (5.47x)
1660.1 (14.18x) 20099.5 (200.99x) 1760.2 (5.71x)
1754.0 (14.98x) 26821.8 (268.22x) 1833.2 (5.95x)
1847.6 (15.78x) 35780.7 (357.81x) 1905.1 (6.18x)
1941.1 (16.58x) 47720.3 (477.20x) 1975.7 (6.41x)
2034.6 (17.37x) 63632.1 (636.32x) 2045.4 (6.64x)

(The absolute value of the third column appears to be similar to the values in the first column but that's because the split-3-atom model started out at a higher halflife: the primary atom of that model has halflife of 100, so between the strengthening and the long-term atoms, the overall halflife is much higher than 100. That's why you want to pay attention to the parenthetical number, how much bigger this halflife is from the starting halflife.)

After some tweaking of the parameters of this model, we find that it's very competitive with the ensemble and the Beta-power-law approaches:

split-compare

*Dev instructions to generate this plot* To obtain this plot, 1. create a venv or Conda env, 2. install dependencies: `python -m pip install numpy scipy pandas matplotlib tqdm ipython "git+https://github.com/fasiha/ebisu@v3-release-candidate"`, 3. then clone this repo and check out the release candidate branch: `git clone https://github.com/fasiha/ebisu.git && cd ebisu && git fetch -a && git checkout v3-release-candidate`, 4. download my Anki reviews database: [collection-no-fields.anki2.zip](https://github.com/fasiha/ebisu/files/13405477/collection-no-fields.anki2.zip), unzip it, and place `collection-no-fields.anki2` in the `scripts` folder so the script can find it 5. start ipython: `ipython` 6. run the script: `%run scripts/split3.py`. This will produce some text/figures.

Compare to the ensemble approach:

ensemble-compare

and the Beta-power-law results:

beta-powerlaw-compare

Indeed, for the first half of the graphs above (the flashcards for which I had a lot of failed quizzes), this "split-3-atom" model outperforms the two alternatives.

When I initially sketched this split-3-atom model, I thought the first atom would have a lot of weight, like 80%, and the next two atoms would have 10% each. Turns out that an equal split works the best, one-third weight for each. There also appears to be some advantage to scaling the second atom to 5x the first atom's halflife instead of 2x in terms of focal loss performance, but we'll have to see if that's "real" or just the loss function being weird.

As usual, I'm going to stew over this and poke around the text file generated by the script above that delves into the predictions made for each model for individual quizzes per flashcard. But I'm tentatively excited about this model. It's lacks the mathematical elegance of the Beta power-law model and needs more parameters (specifically, the weights and the halflife-scalar for the second atom), but so far I like its behavior a lot.

fasiha commented 9 months ago

As in the previous dev diary for the Beta power-law model, the script has a GRID_MODE flag that iterates over initial α=β as well as initial halflife, and for each tuple, sums the focal loss over all quizzes, all flashcards. That's what suggested the 24 hour halflife for the equal-weighted case:

focal-split

zxl777 commented 9 months ago

@fasiha I have developed a free online version of Flashcards available at https://itoytoy.com/anki I plan to use ebisu 3.0 and will regularly sync the review data from users' cards with you for further optimization of ebisu.

I have previously used ebisu 2.1 in my product, but feel that its potential has not been fully utilized in practical applications. After integrating ebisu 3.0, should any issues arise, I will consult with you for guidance.

Thank you.

fasiha commented 2 months ago

My friends at https://github.com/open-spaced-repetition/srs-benchmark/pull/112 introduced me to the idea of using AUC (area under curve, also known as ROC (receiver operating characteristic)) for quiz scheduling. This is a fantastic application of AUC/ROC!

Background: AUC/ROC arises naturally in binary classification when classifiers output a real number that is then quantized to give the final prediction class. The question inevitably arises: what's the threshold for the quantizer? If the threshold is very low, then the classifier will always say "True", leading to low missed detection rate (yay!) but high false alarm rate (sad). If the threshold is very high, then the classifier will say "True" very rarely, leading to tons of missed detections (sad) but very few false alarms (yay!). As you sweep the threshold from -∞ to +∞, it traces a curve which looks like this:

split-auc

Above image generated by https://github.com/fasiha/ebisu/blob/4644b4d9b4cbe55732d6c5c936d3e3dc884ba205/scripts/split3.py. To run this, check out the v3-release-candidate branch, grab my database of flashcards and unzip it in the scripts/ directory, and run scripts/split3.py.

ROC curves have this characteristic shape. A totally random classifier's ROC curve is the red dotted line, the 45 degree line from (0, 0) to (1, 1), sweeping the false negative rate vs true positive rate as the threshold goes from -∞ to +∞. You can integrate the area under the curve (AUC) to collapse each line into a single number.

On this chart there are four lines, that kind of naturally fall into two groups.

  1. red: a 3-atom model with primary halflife of one day
  2. blue: another 3-atom model with primary halflife of 100 hours (four days), and other parameters similar to the first
  3. purple: an Ebisu v2 model (Beta distribution on recall) with low α=β=1.24 and one day halflife
  4. gray: another Ebisu v2 model with very low α=β=1 and one week (168 hour) halflife, which is what the benchmark in https://github.com/open-spaced-repetition/srs-benchmark/pull/112 uses

Intriguingly, the ROC and AUC for (1) and (3) are quite similar (0.69) while those of (2) and (4) are quite similar (0.64). The 3-atom models do much better in terms of focal loss (which I prefer over log-likelihood because it handles the imbalance between successful reviews (very common) and failed reviews (very uncommon) better):

split-compare

This is fascinating because, as described in various issues here as well as in https://github.com/open-spaced-repetition/srs-benchmark/pull/112#issuecomment-2321567400, Ebisu v2 is grotesquely pessimistic about the probability of recall (as shown by its terrible performance in focal loss in the second chart), but I had always guessed/hoped that it'd handle relative ranking between cards better. There was no real reason for this hope, other than, when I use Ebisu in my quiz apps, I was satisfied with which card it picked as most likely to be forgotten. The AUC is actually a metric that can potentially make this concrete: Wikipedia is surprisingly lucid here—

[AUC] is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (permalink)

That is, AUC tells us how often Ebisu's predictRecall will give a higher number to a quiz that the student passed versus a quiz that the student failed. I think this means that AUC is a good metric for relative ranking?

Assuming this is the case (higher AUC → better at relative ranking of cards according to recall probability)—per the charts above, we see that at least one Ebisu v2 initial parameterization is competitive with the 3-atom model. This doesn't necessarily mean that Ebisu v2 (or the 3-atom model) is good at relative ranking (i.e., telling "card A is more likely to be forgotten than card B")! Both might be bad! Both might be good? The analysis in https://github.com/open-spaced-repetition/srs-benchmark/pull/112 suggests Ebisu v2 with initial α=β=1 and initial half-life of 7 days has very bad AUC. I need to run Ebisu v2 against that dataset and after confirming I get the same value for AUC, I'll be able to say whether the 3-split model is better or worse at relative ranking, and whether AUC actually measures this.