fasiha / ebisu

Public-domain Python library for flashcard quiz scheduling using Bayesian statistics. (JavaScript, Java, Dart, and other ports available!)
https://fasiha.github.io/ebisu
The Unlicense
311 stars 32 forks source link

non-boolean quiz results #23

Closed garfieldnate closed 3 years ago

garfieldnate commented 4 years ago

Many quiz systems do not assign a simple pass-fail to study events. Some systems, like Anki, simply ask the user how well they think they know the answer. Others, like Duolingo, assign a score based on performance in a study session with several exercises. It would be great if Ebisu could be extended to handle this case; so updateRecall(prior: tuple, result: bool, tnow: float) would be changed to `updateRecall(prior: tuple, result: float, tnow: float).

This would also enable comarison with the other systems compared in the half-life regression paper from Duolingo, meaning this ticket may be a pre-requisite for #22.

fasiha commented 4 years ago

Thanks for commenting! I talked about this in another issue: https://github.com/fasiha/ebisu/issues/19#issuecomment-581235049

In a nutshell, yes, I think Ebisu can handle the binomial quiz case, like Duolingo, instead of a Bernoulli quiz case like it currently does. I'm just not sure when I can promise to make time to look into this, but it's definitely on my radar.

fasiha commented 4 years ago

Closed by version 2.0.0: https://github.com/fasiha/ebisu/blob/gh-pages/CHANGELOG.md

garfieldnate commented 4 years ago

Wow. That was fast! šŸ˜„

So the binomial case. This models the quiz as a bunch of boolean results, right? Is there any model mismatch between that and the Anki case? Anki doesn't give the result of a series of boolean quizes. It gives the user's confidence score about how well they think they know the answer. I'm working on a quiz app that uses this mechanism.

fasiha commented 4 years ago

Your understanding is accurate: thereā€™s not an exact match between binomial quizzing and Ankiā€™s confidence ratings (in contrast, the binomial quiz corresponds well to Duolingoā€™s quizzes šŸ˜).

Nonetheless I am tentatively confident that you can use the new API to fake Anki-style confidence, we just donā€™t yet have guidelines on how to best do so (whereas weā€™ve built up a lot of confidence in Ebisuā€™s handling of binary quizzes).

For example you could do this:

But do note that thereā€™s risks to using large the number of successes:

  1. the unit tests cover total=5 and below for a range of other parameters, but you can still encounter numerical instability and assertion failures in extreme edge cases.
  2. Moreover, the math is very aggressive when successes is either close to 0 or close to total. I will update this thread with examples but qualitatively if a successful quiz increased the half life by 50% for the total=1 case, it might increase it by 2x for total=2, and 4x for `total=3ā€˜: thereā€™s some kind of exponential factor that I noticed during testing but this is what I mean when I say we donā€™t have guidelines on how to hack this use case into the framework currently.l, and your experimentation will be a very useful part of that šŸ˜.

One thing Iā€™ll add that so far Iā€™ve built quiz apps with automatically-graded binary quizzes and really liked that. I personally will continue designing quiz apps like that, and only allowing the use of the true binomial case (with total>1 in cases) when the student feels very strongly about going back and changing the rating of their quiz, like ā€œI really really got that flashcard, go back and mark it as super-easyā€ or ā€œcrap that was really hard to remember and yes, I know that exercising near-forgotten memories strengthen them but please mark that as a near-missā€.

Does all this make sense? Iā€™m happy to elaborate. Iā€™ll update this with some numbers in roughly twelve hours.

fasiha commented 4 years ago

the binomial case. This models the quiz as a bunch of boolean results, right?

A hyperfine quibble here: that is only somewhat the case. A binary (Bernoulli) experiment is a coin flip. A binomial quiz is N simultaneous con flips (or actually, independent coin flips, donā€™t have to be simultaneous). As long as you keep this on mind when calling them ā€œa bunch of booleansā€ you are good to go.

I mention this only because, if you actually have a sequence of boolean quizzes (where you show the user the correct answer after each quiz) then thatā€™s not a binomial quiz, thatā€™s an entirely normal sequence of binary quizzes šŸ˜†. Let me know if I should explain more.

garfieldnate commented 4 years ago

One thing Iā€™ll add that so far Iā€™ve built quiz apps with automatically-graded binary quizzes and really liked that.

I personally have enjoyed the slight interactivity that comes with grading yourself on the strength of the memory (like Anki, or my favorite, CleverDeck), but admittedly I don't know if it's actually beneficial to my learning. As a user it feels a little bit motivating to be able to declare to the app, "I almost got that perfectly!", and the added interactivity may contribute to using the app more often, which would mean more learning in the long run. But I don't know if this is actually useful information for scheduling reviews, especially since the currently-available apps leave it up to the user to decide how to grade themselves. Doing this with a proper bayesian statistical model would require the app to learn this on a per-user basis.

when the student feels very strongly about going back and changing the rating of their quiz

So this is like if the user accidentally clicks the wrong button or something, right? Is this feature really the proper use for that? Seems like the app author should provide some kind of rewind/undo functionality for this that in the end only recalculates using total=1. Wouldn't total=2, successes=1 lead to different model parameters?

if you actually have a sequence of boolean quizzes (where you show the user the correct answer after each quiz) then thatā€™s not a binomial quiz, thatā€™s an entirely normal sequence of binary quizzes

The second one is actually the Duolingo case, then; although the data they provide has floating point numbers, in actuality their app presents several quizzes in a row, and what they report in the dataset is the final score over these individual quizzes. Now that I write that (out loud? XD), that actually seems like an important problem with their dataset.

But if that's the Duolingo case, what is the actual case that's being modeled by the new feature? Or is the new feature even designed for the modeling of a specific case? The user must be quizzed several times with no dependencies between the answers on each quiz. I know it's common to assume independence for modeling purposes, but this really seems unlikely in a quiz setting like this; apps give feedback to correct a user between quizzes, or the user consistently gets a fact right or wrong.

fasiha commented 4 years ago

I personally have enjoyed the slight interactivity that comes with grading yourself on the strength of the memory (like Anki, or my favorite, CleverDeck)

Excellent, this is good to knowā€”prevents author tunnel vision šŸ˜‡! For me, the extra cognitive burden of interrupting my reviews to make the meta-decision is draining, so thank you for the counterexample.

I don't know if this is actually useful information for scheduling reviews,

Intuitively, if something was easy, then its updated halflife should be scaled more aggressively than if it was very difficult to remember right?

But I wonder how accurate that intuition is, and if the reality is more complicated than that. Occasionally I have the experience where I see a review and think to myself, "Wuh, when did the app teach me this?" I know the answer but can't recall learning it.

Other times I contrast the experience of (1) having to review something for which I didn't create a strong enough mnemonic for, so I'm floundering for the correct answer, versus (2) a similar situationā€”relatively young flashcard with a hastily-made mnemonicā€”but that I remember the answer easily. Does this mean case 1 was harder than case 2? Or just that brain performance, like physical performance, is related to various factors like time of day, food, sleep, stress, etc.?

Thinking about these things tempers my rush to convert all my apps from binary quizzes to faking ease with binomial quizzes, except when I feel strongly about it, which brings us toā€¦

So this is like if the user accidentally clicks the wrong button or something, right?

No, more like, "I'm annoyed at how often I've been asked this, seriously, increase its halflife aggressively" (set success=total=2) or "O m g, I really struggled to remember this, please ask me to review this sooner than otherwise" (set success=1, total=2). In case of typo, yes, the app should provide a simpler mechanism to correct.

The second one is actually the Duolingo case, then; although the data they provide has floating point numbers, in actuality their app presents several quizzes in a row, and what they report in the dataset is the final score over these individual quizzes. Now that I write that (out loud? XD), that actually seems like an important problem with their dataset.

I forget if Duolingo app gives you feedback for each sentence? I hate it so much I don't want to reinstall it to find out, forgive me. When I first wrote this, I thought it maybe didn't give detailed feedback on what you missed until the end, but I am likely misremembering.

But if that's the Duolingo case, what is the actual case that's being modeled by the new feature? Or is the new feature even designed for the modeling of a specific case?

The quiz style that would closest match the binomial quizzes Ebisu now handles would be where you're asked to recall the same flashcard multiple times in a single review session in a context where you don't particularly focus on that flashcard itself, and are instead occupied with the broader context of the review task at hand.

For example. If Duolingo asked you to translate a few sentences into French, and then at the end of the review session, it reviewed what you'd missed, showing you that you'd misconjugated ecrir once but got it right once, e.g., then that'd be a great match for binomial quizzes.

As you say, if you give feedback on a per-flashcard level after each quiz, then you probably have a series of binary quizzes. Such an app would also be very boring since you're over-reviewing things too frequently (assuming you got them right, Ebisu would barely change the model after the first quiz since, because so little time elapsed since the previous quiz).

So supporting that was one goal for allowing binomial quizzes. The other was to see if it turned out to be a good way to hack user-reported ease.

ttencate commented 4 years ago

Hi, I'm also interested in this topic. If it all works out, I'll send my Dart port of Ebisu your way in due time :) [Update 2020-09-05: see #36.]

I forget if Duolingo app gives you feedback for each sentence? I hate it so much I don't want to reinstall it to find out, forgive me. When I first wrote this, I thought it maybe didn't give detailed feedback on what you missed until the end, but I am likely misremembering.

Duolingo gives you immediate feedback on your answers; it doesn't wait until the end of the session. So @garfieldnate is right: it's just like a regular repeat, albeit with a very short interval.

Is that also a good way to treat it in Ebisu though? The second repeat happens only minutes after the first, so the recall probability will be close to 1. If the right answer is given the second time, the algorithm won't infer much from this; if the wrong answer is given, half-life is severely penalized. OK, I guess this is what we want actually :)

fasiha commented 4 years ago

I'll send my Dart port of Ebisu your way in due time :)

Wow, most kind thank you @ttencate!

Is that also a good way to treat it in Ebisu though? The second repeat happens only minutes after the first, so the recall probability will be close to 1. If the right answer is given the second time, the algorithm won't infer much from this; if the wrong answer is given, half-life is severely penalized. OK, I guess this is what we want actually :)

I agree with your analysis. As the quiz app author, you'll have to decide how to structure your quizzes, if you want to have closely-spaced repeats, and how to handle failures soon after successes. You should feel free to cheat Ebisu for the greater good.

I might add here that I'm working on a solution to the numerical instability problem for low successes and high total cases after very little time has elapsed. It's a pathological edge case that shouldn't happen often in real life but I'm hoping to find a proper workaround, maybe it will help experimentation with binomial quizzes.

ttencate commented 4 years ago

Cool, thanks for confirming.

Back on the topic of non-binary results: how hard would it be to implement this? Would the model need to be changed drastically, or could we come up with a way to interpret a fractional result as "partially forgotten"?

Faking it with the successes and total arguments might achieve the desired effect to some extent, but is it theoretically correct to just substitute x% by a successes/total ratio that approximates x%? If that is indeed correct, then a real-valued correctness score would be strictly more general, and thus better from an API point of view.

fasiha commented 4 years ago

Hmm, good questions. A couple of things.

The binary-to-binomial extension was straightforward mathematically (even if a chore to get the derivation and implementation right!), since a binary quiz is a binomial quiz, with integer successes and total.

But I will certainly think about how to do updates with a "soft-binary" result, i.e., a percent. Likely we can keep the same form of the model. Maybe we can make quizzes be noisy-binaryā€¦ will think!

Editā€”I think this is very relevant https://stats.stackexchange.com/questions/419197/bayesian-inference-for-beta-distribution-after-an-uncertain-outcome though there's no guarantee we can include it into our GB1 prior (Beta prior after exponential forgetting) cleanly.

fasiha commented 4 years ago

I found a tidy way to allow quiz results to be float between 0 and 1 (inclusive). Code at branch https://github.com/fasiha/ebisu/commit/abecdbb813a12083ccb9c8dce24417659591f81a#diff-35ff1833326ddbe951882636c4cbc678R131

It uses the noisy binary model linked earlier and I describe it in detail below to invite comments on the API to design around it, because the mathematical model is more flexible than I want to expose programmatically.

Mathematically, a noisy-binary quiz consists of a normal binary quiz (drawn from the prior probability distribution on recall probability; in the README I call this x) that goes through a scrambling process, giving you the observed noisy-binary quiz result, which we can call z. You don't get to see the original x, only z. Two parameters govern this scrambling process:

So a noisy-binary quiz result consists of:

  1. the boolean result, True or False, (this is z)
  2. q_1 between 0.5 and 1, and
  3. q_0 between 0 and 0.5.

In Ebisu's terms, I'm thinking of the original quiz result (original as in, before scrambling) as the result you would have gotten if your recall was purely a function of the strength of the memory in your mind (which we have a probability distribution over). Meanwhile, the scrambled noisy-binary quiz result is whether you actually passed the quiz or not, and is related to any number of other factors beyond the strength of your memory (sleep, hunger, motivation, focus, etc.).

I admit it's tricky to carefully map probability into the real world, and this explanation might be forced, feel free to weigh in.

As mentioned above, the math gives us two knobs to independently turn, q_0 and q_1, which, again, are the probabilities that you actually passed the quiz (z=1) given that the true unscrambled quiz result x=0 or x=1 respectively. If you don't want to think too much about this, I think it's reasonable to link both so q_0 = 1 - q_1. The table below shows the halflife after a noisy-binary quiz where the "Noise" column = q_0 = 1 - q_1. Noise of 0 means the no-noise binary case that we've known this whole time. Initial model is (3.3, 3.3, 1), i.e., an initial halflife of 1. The table also shows different quiz times (tnow) to help guage the algorithm's behavior:

Noise Observed result Quiz time New halflife
0 True 0.25 1.061
0 True 1.0 1.241
0 True 3.0 1.718
0 False 0.25 0.762
0 False 1.0 0.816
0 False 3.0 0.902
0.1 True 0.25 1.052
0.1 True 1.0 1.188
0.1 True 3.0 1.362
0.1 False 0.25 0.852
0.1 False 1.0 0.849
0.1 False 3.0 0.914
0.25 True 0.25 1.037
0.25 True 1.0 1.113
0.25 True 3.0 1.143
0.25 False 0.25 0.931
0.25 False 1.0 0.901
0.25 False 3.0 0.937
0.5 True 0.25 1.000
0.5 True 1.0 1.000
0.5 True 3.0 1.000
0.5 False 0.25 1.000
0.5 False 1.0 1.000
0.5 False 3.0 1.000

The noise dial serves to dampen the impact of the review. When the noise level is 0, you get the normal Ebisu behavior. At noise level is 0.5 (the highest noise level, meaning z is a pure coin flip, without any dependence on x), the quiz is compeltely uninformative, and gives you a totally unchanged updated model. In between, you get an updated model whose halflife is between these two.

I am considering updating updateRecall's API to take a single float between 0 and 1 for noisy-binary results, and parsing it as follows:

if noisyResult > 0.5:
  result = True
  q_1 = noisyResult
  q_0 = 1 - noisyResult
else:
  result = False
  q_1 = 1 - noisyResult
  q_0 = noisyResult

Might I propose the following matches to Anki's levels:

I will add another function in the API to allow you to do the equivalent of Anki's "easy" and its inverse, "epic fail": it will take a model and a number to scale the halflife with, and return a new model with the same spread as the original but calibrated to a new halflife. This new function totally side-steps the Bayesian update process, and is intended to be used sparingly for flashcards you really want to delay reviewing (e.g., scale halflife by 2x, say) or that you want to review more frequently (scale halflife by 0.5x).

Comments welcome. Related #19.

garfieldnate commented 4 years ago

Thanks for doing all this! This is such great work using a pretty unique skill. Plenty of us can hack, but I don't know many that can apply this kind of mathematical skill while doing it.

In the code block you provided, I didn't understand the meaning of setting result to True or False. Shouldn't the value be determined by input to the API?

The API you present makes sense. Result plus a noise parameter. I do have some comments on the suggested usage, though.

Specifically for the Anki case, I think a better strategy would be 0/0.1/0.9/1 for the user judgement inputs. Since Anki presents the "Again" and "Easy" buttons like they are regular judgements and not special cases, I don't think it would make sense to step outside of the model for those inputs. Perhaps if the user were presented with three choices ("got it", "almost got it", "don't got it"), then the 0/0.9/1 input would make sense. Then as a separate feature we could allow the user to manually change the review times when they think to themselves, "I'm sick of this one please, stop showing it!" or "I don't remember ever seeing this before, better show me that again soon" using UI that makes it clear that this is an exceptional case.

Also, ideally it may be worth learning the noise parameter (outside of Ebisu, not within it) for these judgements for each user, as they are quite subjective and can be interpreted differently for different users.

fasiha commented 4 years ago

In the code block you provided, I didn't understand the meaning of setting result to True or False. Shouldn't the value be determined by input to the API?

Ah I should clarify, the code snippet in my comment above, starting with if noisyResult > 0.5, would be inside updateRecall: you would only need to provide a float between 0 and 1. So you are free to construct any mapping between user responses and Ebisu for your app. I have only very vague memories of Anki so your instinct about matching Anki's approach to Ebisu would doubtless be better than mine.

One thought I did have is, there's no way to use the noisy-binary update to dramatically change the halflifeā€”it might have been nice if you could give 2.0 to the noisy-binary update to indicate "easy", but that won't work. Noisy-binary can only dial between the binary 0 and 1 cases.

Side node thoughā€”the Ebisu version 2 binomial quiz model, with integer successes and total, does provide this: giving successes of 0 or 2 when total=2 serves to exponentially decrease or increase halflife, since it models multiple independent trials of memory, and can push the halflife beyond what you'd get with total=1 (the binary quiz case). I'm hesitant to officially recommend using this to achieve this effect though, since it is a modeling errorā€¦ But at the same time, it's more principled than just rescaling the halflife (more below).

I'm still sorting out my feelings about offering three ways to change models:

  1. noisy-binary quiz, where you provide a single float to updateRecall,
  2. binomial quiz session, where you provide integer successes and totals to updateRecall (though, having high totals-successes with tnow much lower than halflife can cause numerical instability šŸ˜£ I'm working on it),
  3. quiz rescaling (more below)

Ebisu used to offer one method, now it will offer three. I'm somewhat concerned that this makes the API harder to learn to use effectively, and constrains the future evolution of the code. But I think it's reasonable to offer this menu to quiz app authors.

Here's the tentative API and docstring for rescaleHalflife (full source):

def rescaleHalflife(prior, scale=1.):
  """Given any model, return a new model with the original's halflife scaled.

  Use this function to adjust the halflife of a model.

  Perhaps you want to see this flashcard far less, because you *really* know it.
  `newModel = rescaleHalflife(model, 5)` to shift its memory model out to five
  times the old halflife.

  Or if there's a flashcard that suddenly you want to review more frequently,
  perhaps because you've recently learned a confuser flashcard that interferes
  with your memory of the first, `newModel = rescaleHalflife(model, 0.1)` will
  reduce its halflife by a factor of one-tenth.

  Useful tip: the returned model will have matching Ī± = Ī², where `alpha, beta,
  newHalflife = newModel`. This happens because we first find the old model's
  halflife, then we time-shift its probability density to that halflife. That's
  the distribution this function returns, except at the *scaled* halflife.
  """

Comments and questions and thrown tomatoes welcome.

garfieldnate commented 4 years ago

I don't have all of the math expertise you have, but it seems like the three methods of updating are fairly generalizable to new models, if you choose to evolve it. I can understand the hesitation, though, given you have several different language implementations to maintain.

garfieldnate commented 3 years ago

Hi Ahmed, I don't mean to bother you, but I want to express my continued interest in this topic :) I'm working on a quiz app and would love to be able to specify floats for the recall update.

fasiha commented 3 years ago

Thanks for pinging @garfieldnate! I will aim to package up the changes we talked about in this thread this week.

We also have #41 with a better way to initialize rebalance and #31 to always rebalance that I'd like to push out but those are behind-the-scenes changes I can work on whenever. I'd like to avoid delaying releasing fuzzy reviews and the rescaling API!

(Part of the reason I delayed releasing these was because of #43, which raised a crucial modeling issue that made me go back to the drawing board for much of Ebisu šŸ˜…. That issue, though brand new, was raised on Reddit in late-November 2020, so apologies, I still delayed almost six months on this issue!)

garfieldnate commented 3 years ago

No need to apologize! It's volunteer work šŸ˜„ I really appreciate what you have been sharing here and also how responsive you are.

fasiha commented 3 years ago

@garfieldnate I haven't forgotten about this, please expect this to land within a couple of days, and please feel free to ping if it doesn't and you get tired of waiting. It's the usual thingā€”prototyping something is often the easy part, productionizing it with unit tests and this insane standard for documentation I'm holding myself to for this repo and etc. means things take 10x to 100x longer.

fasiha commented 3 years ago

@garfieldnate thanks for your patience! Pushed to PyPI šŸ˜„!

ernesto-butto commented 3 years ago

Hello @fasiha! I was wondering if you were planning to publish a changelog of version 2.1.0 ? Thank you!

fasiha commented 3 years ago

@poolebu yes! Itā€™s at https://github.com/fasiha/ebisu/blob/gh-pages/CHANGELOG.md

fasiha commented 3 years ago

@poolebu in short, and maybe I should highlight this more in the changelog, no breaking changes, hence the 2.0 -> 2.1, just more functionality. But now that I think about it, the underlying behavior of updateRecall changed so calling it now at 2.1 with the same arguments will result in different numbers than calling it before at 2.0. Does that mean it should have been a major version update? šŸ¤”

ernesto-butto commented 3 years ago

Hello @fasiha! Thank you for all the changes and documentation. I will be trying rescaleHalflife in the upcoming months. Probably my app will have a beta user group and we will get users feedback and stats.

I do not think the new updateRecall behavior should result in a new version update if

the differences are statistically very minor

Thank you again, congrats for the new updates! and wish you a great day