Closed garfieldnate closed 3 years ago
Thanks for commenting! I talked about this in another issue: https://github.com/fasiha/ebisu/issues/19#issuecomment-581235049
In a nutshell, yes, I think Ebisu can handle the binomial quiz case, like Duolingo, instead of a Bernoulli quiz case like it currently does. I'm just not sure when I can promise to make time to look into this, but it's definitely on my radar.
Closed by version 2.0.0: https://github.com/fasiha/ebisu/blob/gh-pages/CHANGELOG.md
Wow. That was fast! š
So the binomial case. This models the quiz as a bunch of boolean results, right? Is there any model mismatch between that and the Anki case? Anki doesn't give the result of a series of boolean quizes. It gives the user's confidence score about how well they think they know the answer. I'm working on a quiz app that uses this mechanism.
Your understanding is accurate: thereās not an exact match between binomial quizzing and Ankiās confidence ratings (in contrast, the binomial quiz corresponds well to Duolingoās quizzes š).
Nonetheless I am tentatively confident that you can use the new API to fake Anki-style confidence, we just donāt yet have guidelines on how to best do so (whereas weāve built up a lot of confidence in Ebisuās handling of binary quizzes).
For example you could do this:
But do note that thereās risks to using large the number of successes:
total=5
and below for a range of other parameters, but you can still encounter numerical instability and assertion failures in extreme edge cases.successes
is either close to 0 or close to total
. I will update this thread with examples but qualitatively if a successful quiz increased the half life by 50% for the total=1
case, it might increase it by 2x for total=2
, and 4x for `total=3ā: thereās some kind of exponential factor that I noticed during testing but this is what I mean when I say we donāt have guidelines on how to hack this use case into the framework currently.l, and your experimentation will be a very useful part of that š. One thing Iāll add that so far Iāve built quiz apps with automatically-graded binary quizzes and really liked that. I personally will continue designing quiz apps like that, and only allowing the use of the true binomial case (with total>1
in cases) when the student feels very strongly about going back and changing the rating of their quiz, like āI really really got that flashcard, go back and mark it as super-easyā or ācrap that was really hard to remember and yes, I know that exercising near-forgotten memories strengthen them but please mark that as a near-missā.
Does all this make sense? Iām happy to elaborate. Iāll update this with some numbers in roughly twelve hours.
the binomial case. This models the quiz as a bunch of boolean results, right?
A hyperfine quibble here: that is only somewhat the case. A binary (Bernoulli) experiment is a coin flip. A binomial quiz is N simultaneous con flips (or actually, independent coin flips, donāt have to be simultaneous). As long as you keep this on mind when calling them āa bunch of booleansā you are good to go.
I mention this only because, if you actually have a sequence of boolean quizzes (where you show the user the correct answer after each quiz) then thatās not a binomial quiz, thatās an entirely normal sequence of binary quizzes š. Let me know if I should explain more.
One thing Iāll add that so far Iāve built quiz apps with automatically-graded binary quizzes and really liked that.
I personally have enjoyed the slight interactivity that comes with grading yourself on the strength of the memory (like Anki, or my favorite, CleverDeck), but admittedly I don't know if it's actually beneficial to my learning. As a user it feels a little bit motivating to be able to declare to the app, "I almost got that perfectly!", and the added interactivity may contribute to using the app more often, which would mean more learning in the long run. But I don't know if this is actually useful information for scheduling reviews, especially since the currently-available apps leave it up to the user to decide how to grade themselves. Doing this with a proper bayesian statistical model would require the app to learn this on a per-user basis.
when the student feels very strongly about going back and changing the rating of their quiz
So this is like if the user accidentally clicks the wrong button or something, right? Is this feature really the proper use for that? Seems like the app author should provide some kind of rewind/undo functionality for this that in the end only recalculates using total=1
. Wouldn't total=2, successes=1
lead to different model parameters?
if you actually have a sequence of boolean quizzes (where you show the user the correct answer after each quiz) then thatās not a binomial quiz, thatās an entirely normal sequence of binary quizzes
The second one is actually the Duolingo case, then; although the data they provide has floating point numbers, in actuality their app presents several quizzes in a row, and what they report in the dataset is the final score over these individual quizzes. Now that I write that (out loud? XD), that actually seems like an important problem with their dataset.
But if that's the Duolingo case, what is the actual case that's being modeled by the new feature? Or is the new feature even designed for the modeling of a specific case? The user must be quizzed several times with no dependencies between the answers on each quiz. I know it's common to assume independence for modeling purposes, but this really seems unlikely in a quiz setting like this; apps give feedback to correct a user between quizzes, or the user consistently gets a fact right or wrong.
I personally have enjoyed the slight interactivity that comes with grading yourself on the strength of the memory (like Anki, or my favorite, CleverDeck)
Excellent, this is good to knowāprevents author tunnel vision š! For me, the extra cognitive burden of interrupting my reviews to make the meta-decision is draining, so thank you for the counterexample.
I don't know if this is actually useful information for scheduling reviews,
Intuitively, if something was easy, then its updated halflife should be scaled more aggressively than if it was very difficult to remember right?
But I wonder how accurate that intuition is, and if the reality is more complicated than that. Occasionally I have the experience where I see a review and think to myself, "Wuh, when did the app teach me this?" I know the answer but can't recall learning it.
Other times I contrast the experience of (1) having to review something for which I didn't create a strong enough mnemonic for, so I'm floundering for the correct answer, versus (2) a similar situationārelatively young flashcard with a hastily-made mnemonicābut that I remember the answer easily. Does this mean case 1 was harder than case 2? Or just that brain performance, like physical performance, is related to various factors like time of day, food, sleep, stress, etc.?
Thinking about these things tempers my rush to convert all my apps from binary quizzes to faking ease with binomial quizzes, except when I feel strongly about it, which brings us toā¦
So this is like if the user accidentally clicks the wrong button or something, right?
No, more like, "I'm annoyed at how often I've been asked this, seriously, increase its halflife aggressively" (set success=total=2
) or "O m g, I really struggled to remember this, please ask me to review this sooner than otherwise" (set success=1, total=2
). In case of typo, yes, the app should provide a simpler mechanism to correct.
The second one is actually the Duolingo case, then; although the data they provide has floating point numbers, in actuality their app presents several quizzes in a row, and what they report in the dataset is the final score over these individual quizzes. Now that I write that (out loud? XD), that actually seems like an important problem with their dataset.
I forget if Duolingo app gives you feedback for each sentence? I hate it so much I don't want to reinstall it to find out, forgive me. When I first wrote this, I thought it maybe didn't give detailed feedback on what you missed until the end, but I am likely misremembering.
But if that's the Duolingo case, what is the actual case that's being modeled by the new feature? Or is the new feature even designed for the modeling of a specific case?
The quiz style that would closest match the binomial quizzes Ebisu now handles would be where you're asked to recall the same flashcard multiple times in a single review session in a context where you don't particularly focus on that flashcard itself, and are instead occupied with the broader context of the review task at hand.
For example. If Duolingo asked you to translate a few sentences into French, and then at the end of the review session, it reviewed what you'd missed, showing you that you'd misconjugated ecrir once but got it right once, e.g., then that'd be a great match for binomial quizzes.
As you say, if you give feedback on a per-flashcard level after each quiz, then you probably have a series of binary quizzes. Such an app would also be very boring since you're over-reviewing things too frequently (assuming you got them right, Ebisu would barely change the model after the first quiz since, because so little time elapsed since the previous quiz).
So supporting that was one goal for allowing binomial quizzes. The other was to see if it turned out to be a good way to hack user-reported ease.
Hi, I'm also interested in this topic. If it all works out, I'll send my Dart port of Ebisu your way in due time :) [Update 2020-09-05: see #36.]
I forget if Duolingo app gives you feedback for each sentence? I hate it so much I don't want to reinstall it to find out, forgive me. When I first wrote this, I thought it maybe didn't give detailed feedback on what you missed until the end, but I am likely misremembering.
Duolingo gives you immediate feedback on your answers; it doesn't wait until the end of the session. So @garfieldnate is right: it's just like a regular repeat, albeit with a very short interval.
Is that also a good way to treat it in Ebisu though? The second repeat happens only minutes after the first, so the recall probability will be close to 1. If the right answer is given the second time, the algorithm won't infer much from this; if the wrong answer is given, half-life is severely penalized. OK, I guess this is what we want actually :)
I'll send my Dart port of Ebisu your way in due time :)
Wow, most kind thank you @ttencate!
Is that also a good way to treat it in Ebisu though? The second repeat happens only minutes after the first, so the recall probability will be close to 1. If the right answer is given the second time, the algorithm won't infer much from this; if the wrong answer is given, half-life is severely penalized. OK, I guess this is what we want actually :)
I agree with your analysis. As the quiz app author, you'll have to decide how to structure your quizzes, if you want to have closely-spaced repeats, and how to handle failures soon after successes. You should feel free to cheat Ebisu for the greater good.
I might add here that I'm working on a solution to the numerical instability problem for low successes
and high total
cases after very little time has elapsed. It's a pathological edge case that shouldn't happen often in real life but I'm hoping to find a proper workaround, maybe it will help experimentation with binomial quizzes.
Cool, thanks for confirming.
Back on the topic of non-binary results: how hard would it be to implement this? Would the model need to be changed drastically, or could we come up with a way to interpret a fractional result as "partially forgotten"?
Faking it with the successes
and total
arguments might achieve the desired effect to some extent, but is it theoretically correct to just substitute x%
by a successes/total
ratio that approximates x%
? If that is indeed correct, then a real-valued correctness score would be strictly more general, and thus better from an API point of view.
Hmm, good questions. A couple of things.
The binary-to-binomial extension was straightforward mathematically (even if a chore to get the derivation and implementation right!), since a binary quiz is a binomial quiz, with integer successes
and total
.
But I will certainly think about how to do updates with a "soft-binary" result, i.e., a percent. Likely we can keep the same form of the model. Maybe we can make quizzes be noisy-binaryā¦ will think!
EditāI think this is very relevant https://stats.stackexchange.com/questions/419197/bayesian-inference-for-beta-distribution-after-an-uncertain-outcome though there's no guarantee we can include it into our GB1 prior (Beta prior after exponential forgetting) cleanly.
I found a tidy way to allow quiz results to be float between 0 and 1 (inclusive). Code at branch https://github.com/fasiha/ebisu/commit/abecdbb813a12083ccb9c8dce24417659591f81a#diff-35ff1833326ddbe951882636c4cbc678R131
It uses the noisy binary model linked earlier and I describe it in detail below to invite comments on the API to design around it, because the mathematical model is more flexible than I want to expose programmatically.
Mathematically, a noisy-binary quiz consists of a normal binary quiz (drawn from the prior probability distribution on recall probability; in the README I call this x
) that goes through a scrambling process, giving you the observed noisy-binary quiz result, which we can call z
. You don't get to see the original x
, only z
. Two parameters govern this scrambling process:
q_1 = Probability(z = 1 | x = 1)
so 1 - q_1 = Probability(z = 0 | x = 1)
,q_0 = Probability(z = 1 | x = 0)
so 1 - q_0 = Probability(z = 0 | x = 0)
So a noisy-binary quiz result consists of:
z
)q_1
between 0.5 and 1, and q_0
between 0 and 0.5.In Ebisu's terms, I'm thinking of the original quiz result (original as in, before scrambling) as the result you would have gotten if your recall was purely a function of the strength of the memory in your mind (which we have a probability distribution over). Meanwhile, the scrambled noisy-binary quiz result is whether you actually passed the quiz or not, and is related to any number of other factors beyond the strength of your memory (sleep, hunger, motivation, focus, etc.).
I admit it's tricky to carefully map probability into the real world, and this explanation might be forced, feel free to weigh in.
As mentioned above, the math gives us two knobs to independently turn, q_0
and q_1
, which, again, are the probabilities that you actually passed the quiz (z=1
) given that the true unscrambled quiz result x=0
or x=1
respectively. If you don't want to think too much about this, I think it's reasonable to link both so q_0 = 1 - q_1
. The table below shows the halflife after a noisy-binary quiz where the "Noise" column = q_0 = 1 - q_1
. Noise of 0 means the no-noise binary case that we've known this whole time. Initial model is (3.3, 3.3, 1)
, i.e., an initial halflife of 1. The table also shows different quiz times (tnow
) to help guage the algorithm's behavior:
Noise | Observed result | Quiz time | New halflife |
---|---|---|---|
0 | True | 0.25 | 1.061 |
0 | True | 1.0 | 1.241 |
0 | True | 3.0 | 1.718 |
0 | False | 0.25 | 0.762 |
0 | False | 1.0 | 0.816 |
0 | False | 3.0 | 0.902 |
0.1 | True | 0.25 | 1.052 |
0.1 | True | 1.0 | 1.188 |
0.1 | True | 3.0 | 1.362 |
0.1 | False | 0.25 | 0.852 |
0.1 | False | 1.0 | 0.849 |
0.1 | False | 3.0 | 0.914 |
0.25 | True | 0.25 | 1.037 |
0.25 | True | 1.0 | 1.113 |
0.25 | True | 3.0 | 1.143 |
0.25 | False | 0.25 | 0.931 |
0.25 | False | 1.0 | 0.901 |
0.25 | False | 3.0 | 0.937 |
0.5 | True | 0.25 | 1.000 |
0.5 | True | 1.0 | 1.000 |
0.5 | True | 3.0 | 1.000 |
0.5 | False | 0.25 | 1.000 |
0.5 | False | 1.0 | 1.000 |
0.5 | False | 3.0 | 1.000 |
The noise dial serves to dampen the impact of the review. When the noise level is 0, you get the normal Ebisu behavior. At noise level is 0.5 (the highest noise level, meaning z
is a pure coin flip, without any dependence on x
), the quiz is compeltely uninformative, and gives you a totally unchanged updated model. In between, you get an updated model whose halflife is between these two.
I am considering updating updateRecall
's API to take a single float between 0 and 1 for noisy-binary results, and parsing it as follows:
if noisyResult > 0.5:
result = True
q_1 = noisyResult
q_0 = 1 - noisyResult
else:
result = False
q_1 = 1 - noisyResult
q_0 = noisyResult
Might I propose the following matches to Anki's levels:
noisyResult = 1
noisyResult = 0.9
(see table above: pass, noise=0.1)noisyResult = 0
I will add another function in the API to allow you to do the equivalent of Anki's "easy" and its inverse, "epic fail": it will take a model and a number to scale the halflife with, and return a new model with the same spread as the original but calibrated to a new halflife. This new function totally side-steps the Bayesian update process, and is intended to be used sparingly for flashcards you really want to delay reviewing (e.g., scale halflife by 2x, say) or that you want to review more frequently (scale halflife by 0.5x).
Comments welcome. Related #19.
Thanks for doing all this! This is such great work using a pretty unique skill. Plenty of us can hack, but I don't know many that can apply this kind of mathematical skill while doing it.
In the code block you provided, I didn't understand the meaning of setting result
to True
or False
. Shouldn't the value be determined by input to the API?
The API you present makes sense. Result plus a noise parameter. I do have some comments on the suggested usage, though.
Specifically for the Anki case, I think a better strategy would be 0/0.1/0.9/1 for the user judgement inputs. Since Anki presents the "Again" and "Easy" buttons like they are regular judgements and not special cases, I don't think it would make sense to step outside of the model for those inputs. Perhaps if the user were presented with three choices ("got it", "almost got it", "don't got it"), then the 0/0.9/1 input would make sense. Then as a separate feature we could allow the user to manually change the review times when they think to themselves, "I'm sick of this one please, stop showing it!" or "I don't remember ever seeing this before, better show me that again soon" using UI that makes it clear that this is an exceptional case.
Also, ideally it may be worth learning the noise parameter (outside of Ebisu, not within it) for these judgements for each user, as they are quite subjective and can be interpreted differently for different users.
In the code block you provided, I didn't understand the meaning of setting
result
toTrue
orFalse
. Shouldn't the value be determined by input to the API?
Ah I should clarify, the code snippet in my comment above, starting with if noisyResult > 0.5
, would be inside updateRecall
: you would only need to provide a float between 0 and 1. So you are free to construct any mapping between user responses and Ebisu for your app. I have only very vague memories of Anki so your instinct about matching Anki's approach to Ebisu would doubtless be better than mine.
One thought I did have is, there's no way to use the noisy-binary update to dramatically change the halflifeāit might have been nice if you could give 2.0 to the noisy-binary update to indicate "easy", but that won't work. Noisy-binary can only dial between the binary 0 and 1 cases.
Side node thoughāthe Ebisu version 2 binomial quiz model, with integer successes
and total
, does provide this: giving successes
of 0 or 2 when total=2
serves to exponentially decrease or increase halflife, since it models multiple independent trials of memory, and can push the halflife beyond what you'd get with total=1
(the binary quiz case). I'm hesitant to officially recommend using this to achieve this effect though, since it is a modeling errorā¦ But at the same time, it's more principled than just rescaling the halflife (more below).
I'm still sorting out my feelings about offering three ways to change models:
updateRecall
,successes
and totals
to updateRecall
(though, having high totals-successes
with tnow
much lower than halflife can cause numerical instability š£ I'm working on it), Ebisu used to offer one method, now it will offer three. I'm somewhat concerned that this makes the API harder to learn to use effectively, and constrains the future evolution of the code. But I think it's reasonable to offer this menu to quiz app authors.
Here's the tentative API and docstring for rescaleHalflife
(full source):
def rescaleHalflife(prior, scale=1.):
"""Given any model, return a new model with the original's halflife scaled.
Use this function to adjust the halflife of a model.
Perhaps you want to see this flashcard far less, because you *really* know it.
`newModel = rescaleHalflife(model, 5)` to shift its memory model out to five
times the old halflife.
Or if there's a flashcard that suddenly you want to review more frequently,
perhaps because you've recently learned a confuser flashcard that interferes
with your memory of the first, `newModel = rescaleHalflife(model, 0.1)` will
reduce its halflife by a factor of one-tenth.
Useful tip: the returned model will have matching Ī± = Ī², where `alpha, beta,
newHalflife = newModel`. This happens because we first find the old model's
halflife, then we time-shift its probability density to that halflife. That's
the distribution this function returns, except at the *scaled* halflife.
"""
Comments and questions and thrown tomatoes welcome.
I don't have all of the math expertise you have, but it seems like the three methods of updating are fairly generalizable to new models, if you choose to evolve it. I can understand the hesitation, though, given you have several different language implementations to maintain.
Hi Ahmed, I don't mean to bother you, but I want to express my continued interest in this topic :) I'm working on a quiz app and would love to be able to specify floats for the recall update.
Thanks for pinging @garfieldnate! I will aim to package up the changes we talked about in this thread this week.
We also have #41 with a better way to initialize rebalance and #31 to always rebalance that I'd like to push out but those are behind-the-scenes changes I can work on whenever. I'd like to avoid delaying releasing fuzzy reviews and the rescaling API!
(Part of the reason I delayed releasing these was because of #43, which raised a crucial modeling issue that made me go back to the drawing board for much of Ebisu š . That issue, though brand new, was raised on Reddit in late-November 2020, so apologies, I still delayed almost six months on this issue!)
No need to apologize! It's volunteer work š I really appreciate what you have been sharing here and also how responsive you are.
@garfieldnate I haven't forgotten about this, please expect this to land within a couple of days, and please feel free to ping if it doesn't and you get tired of waiting. It's the usual thingāprototyping something is often the easy part, productionizing it with unit tests and this insane standard for documentation I'm holding myself to for this repo and etc. means things take 10x to 100x longer.
Hello @fasiha! I was wondering if you were planning to publish a changelog of version 2.1.0 ? Thank you!
@poolebu yes! Itās at https://github.com/fasiha/ebisu/blob/gh-pages/CHANGELOG.md
@poolebu in short, and maybe I should highlight this more in the changelog, no breaking changes, hence the 2.0 -> 2.1, just more functionality. But now that I think about it, the underlying behavior of updateRecall
changed so calling it now at 2.1 with the same arguments will result in different numbers than calling it before at 2.0. Does that mean it should have been a major version update? š¤
Hello @fasiha! Thank you for all the changes and documentation. I will be trying rescaleHalflife
in the upcoming months. Probably my app will have a beta user group and we will get users feedback and stats.
I do not think the new updateRecall behavior should result in a new version update if
the differences are statistically very minor
Thank you again, congrats for the new updates! and wish you a great day
Many quiz systems do not assign a simple pass-fail to study events. Some systems, like Anki, simply ask the user how well they think they know the answer. Others, like Duolingo, assign a score based on performance in a study session with several exercises. It would be great if Ebisu could be extended to handle this case; so
updateRecall(prior: tuple, result: bool, tnow: float)
would be changed to `updateRecall(prior: tuple, result: float, tnow: float).This would also enable comarison with the other systems compared in the half-life regression paper from Duolingo, meaning this ticket may be a pre-requisite for #22.