Closed RlzHi closed 7 months ago
@Expertium @user1823, what do you think? I will reproduce the result later.
We need to check the stability of the parameters too, i.e., the parameters shouldn't fluctuate too much when re-optimized after doing some more reviews.
I also find that pretrain
could improve the performance significantly when the size of collection is small.
user1823's point is interesting. We need to design some experiments to check the stability of parameters.
I notice that the size of testset is only 1/4 of the trainset. It's rare in practice because users wouldn't optimize their parameters frequently.
Bad news: I find a severe pitfall in the benchmark code when I design the experiment. I have to re-benchmark all models in this week.
Briefly, the benchmark code has a data leakage problem. The testset is used to pick the best parameters during the training process.
How and why would pretrain work? FSRS already has default parameters built into it.
Or do you mean optimizing the first 4 parameters? Is that what you mean by "pretrain"? I was thinking of the kind of pretrain that you did for neural networks.
Regarding the stability of parameters, we could introduce L1 or L2 regularization if n(reviews)<1000 (or some other threshold). It's easy to do with pytorch.
@L-M-Sherlock @user1823 I think there are 2 options: 1) Allow optimization of all parameters even when n(reviews) is very low, say, 100; and introduce L1/L2 regularization when n<1000. 2) If option 1 is difficult to implement, then only do pretrain aka only change the first 4 parameters when 100<=n<=1000. This should be very easy, since the current implementation of pretrain already has a penalty based on the number of reviews.
Btw, I think 16 reviews is too little, I would recommend setting 100 as the new required minimum. Also, @RlzHi, according to your graph the confidence intervals (I assume that's what the bands are) stop overlapping only around 100 reviews.
@Expertium @user1823 I agree that 16 might be too little, especially if Anki auto-optimises FSRS at set intervals. I'm thinking if someone notices FSRS isn't working well with the default parameters for their specific deck, then optimising it with as few as 16 reviews could be beneficial. We could do further research by plotting default FSRS loss on one axis and optimised FSRS loss on the other and checking if there's a value where (even for a low number of reviews) if the default FSRS loss is greater than that value it's better to optimise FSRS. We can test this with pretrain as well.
It also might be best to update the confidence interval calculation if anyone knows the best way to do it mathematically. I'm currently taking a random 10% of the data and finding the mean, repeating that 200 times, and plotting the IQR. (Which shows the confidence interval of the mean, the IQR of the raw values has a different meaning).
Interesting point about stability. Do you think it's important if the parameters change significantly if it achieves a better prediction by doing so?
The specialized tool for calculating accurate confidence intervals is scipy.stats.bootstrap
, method=BCa.
Interesting point about stability. Do you think it's important if the parameters change significantly if it achieves a better prediction by doing so?
The problem is that if the sample size is very small (and 16 is small no matter how you look at it, especially since FSRS has 17 parameters), it may lead to overfitting. The algorithm would perform really well on those 16 reviews, and very poorly in practice.
Gotcha thanks, I'll give that research a try and see what the results are. We can confirm if it overfits significantly with small review sizes regardless of the default FSRS loss, and what the minimum review count we can use is. I think pretrain has potential here with only optimising 4 parameters.
Update:
Optimised FSRS seems to perform better than the default values. The increased variance in confidence intervals around size=8 are from there being very few cases where it's possible to train with size ≤ 8, in which case we use default parameters. Using pre-train (first 4 value optimisation only) would be best for n(reviews)<1000 to ensure parameters don't change too much and increasing the stability of the parameters.
Personally, I do not think this needs changing. Lower sample sizes are more likely to be noisy, and users who are new to Anki may take a while to be consistent in their grading. Slightly increasing performance for the first few hundred reviews does not seem worth it.
Pretrain is robust even on small sample sizes. Plus, a lot of users have complained about the limit. @L-M-Sherlock what do you think? According to Rlz, pretrain is always better than using default parameters, and the L1 penalty ensures that parameters won't become too big/too small on small sample sizes. EDIT: according to Rlz's graph, 100 reviews is where 95% confidence intervals of FSRS-4.5 with pretrain and FSRS-4.5 with default parameters stop overlapping. But that was updated 11 hours ago, so assuming Rlz was running the benchmark all night, he can probably update the graph. EDIT 2: Now with 4000 collections analyzed, it seems that the confidence intervals stop overlapping at around 70 reviews.
@dae I think the most significant improvement is improving the worst case scenario. We've found it's not just the average that is improving, but also the cases where the default parameters don't work for a specific deck.
As an example I came across, there was a deck which did not perform well with the default parameters. The default initial stability was far too high, so the user was forgetting a large amount of the cards before the next review, and the parameters not changing caused this issue to continue for every card they made. After switching to pretrain for the first 1000 reviews, this improved significantly. Even after 16 reviews the initial stability was far more accurate and, if the user was learning that deck again today, they would have been far more likely to achieve their target retention. This is especially true for the start of their learning, which in some cases could be a great cause of motivation.
Also, it's important to note that pretrain factors in the amount of data there is. If the user has not learnt many cards, it will account for this and make smaller changes to the parameters. That's the main deciding factor for us recommending pretrain for the first 1000 reviews, and why it performs so well in our analyses. It's very careful when adjusting the parameters with low sample sizes, yet it can bring significant improvements, especially in the worst case scenario where the assumptions of the default parameters do not hold.
Thanks for elaborating. The fact that it factors in how the sample size is good. I think you've sold me.
@dae I think it's best if you give us some more time to figure out the details. While it's clear that pretrain can be helpful for n(reviews)<1000, it's not clear whether there should still be a hard limit - such that below it default parameters are used, no optimization at all - or whether the hard limit should be removed entirely. So I suggest releasing Anki 24.04 as planned, without changing how the optimizer works compared to the most recent rc. Then, 1 or 2 weeks later, you can release Anki 24.04.1 with the new change once me, LMSherlock, and Rlz figure everything out.
Sorry, I meant I was withdrawing my objection, not suggesting we rush this change in.
@dae disregard what I said above, we found a rule that performs strictly better than using default parameters for any sample size. And your next pharse will be, "I am not fond of the amount of complexity this creates for a relatively small gain". Pseudocode:
old_loss = evaluate(old_parameters, dataset)
default_loss = evaluate(default_parameters, dataset)
loss = inf
parameters = None
if len(dataset) >= 32:
loss, parameters = optimize(dataset) # returns (inf, None) if optimization failed due to insufficient data
if len(dataset) >= 8:
loss_pretrain, parameters_pretrain = pretrain_only(dataset) # returns (inf, None) if optimization failed due to insufficient data
if loss_pretrain < loss:
loss = loss_pretrain
parameters = parameters_pretrain
if default_loss < loss:
loss = default_loss
parameters = default_parameters
if old_loss < loss:
loss = old_loss
parameters = old_parameters
This rule is complicated, but it guarantees three things: 1) Users will never get parameters worse than the default parameters 2) Users will never get parameters worse than the previous parameters 3) That the best results out of default, pretrain and full optimization are selected
It's unlikely that this will be changed further, so unless you or @L-M-Sherlock have any objections, I recommend implementing this rule in the current version of Anki. Sorry that it's just barely before the deadline, but me and Rlz don't see a good reason to postpone it to the next release.
This rule is complicated, but it guarantees three things:
- Users will never get parameters worse than the default parameters
- Users will never get parameters worse than the previous parameters
- That the best results out of default, pretrain and full optimization are selected
It's unnecessary to evaluate the default parameters. For all new users, the old_parameters
is default_parameters
. So the point 2 has ensured the parameters are better than default parameters.
old_loss = evaluate(old_parameters, dataset)
Which loss? RMSE or log loss?
Which loss?
RMSE.
It's unnecessary to evaluate the default parameters. For all new users, the old_parameters is default_parameters
But that is true only for the very first optimization.
we found a rule that performs strictly better than using default parameters for any sample size.
@Expertium, this conclusion is based on the RMSE. But, what about overfitting? Can we be sure that overfitting would not be a significant problem when reviews > 32? Just comparing the RMSE would not help in this case because RMSE would be low when overfitting occurs.
A good way to test overfitting would be to split large datasets based on time and then see how the parameters trained from earlier revlogs perform when applied to later revlogs.
https://github.com/open-spaced-repetition/srs-benchmark/blob/main/notebook/minimum_limit.ipynb
Optimized on x reviews, tested on the next 10.
Optimized on x reviews, tested on the next 100.
Optimized on x reviews, tested on the next 1000.
Conclusion: if the user optimizes parameters on 32 reviews, they will work well for the next 10 reviews, and even for the next 100 reviews. But not for the next 1000 reviews. In order to get a good test error on the next 1000 reviews, the dataset must be 100-200 reviews.
@user1823
This conclusion is based on the RMSE. But, what about overfitting? Can we be sure that overfitting would not be a significant problem when reviews > 32?
If optimisation is made on 32 reviews, and it has overfit, then we would see a test loss much higher than default. Have a look at this histogram Since we don't have as many large losses as default, it shows it isn't overfitting. Or at least, if it is overfitting, it will be by a small amount as it performs better than default; there are fewer worst-case scenarios.
An interesting point is that optimising on 8 reviews and leaving the deck unoptimised for the next 1000 is a bad idea. Have a look at this graph With lookahead 10 (i.e. we will train on Size reviews, then see how well it predicts the next 10), it is best to optimise on all sizes.
Now take a look at this graph If the user optimises on 8 or 16 reviews, then leaves the deck unoptimised for the next 100 reviews, it doesn't do as well.
I think the takeaway from this is that it doesn't overfit in the short-term, but it does overfit in the long-term if left unoptimised. The solution to this seems to be: If the user has optimised on n reviews (e.g. 16), then left the deck unoptimised for 5n reviews (e.g. 100), they should be prompted to re-optimise the deck.
Optimized on x reviews, tested on the next 100.
I think that this is the graph we are most interested in. In other words, we can assume that the user would wait for at least 100 reviews before optimizing again. So, we should ensure that the parameters still fit nicely after 100 more reviews.
The confidence bands of dry-run and optimized parameters stop overlapping at about 101.8 (≈ 63). So, I think that we should not allow optimization unless there are atleast 63 reviews.
Those are not confidence intervals, but rather 25% and 75% percentiles.
EDIT: nevermind, I confused some of the code
Those are not confidence intervals, but rather 25% and 75% percentiles.
Whatever. Stats is not my field of expertise. But, I hope you got the idea.
I think the takeaway from this is that it doesn't overfit in the short-term, but it does overfit in the long-term if left unoptimised. The solution to this seems to be: If the user has optimised on n reviews (e.g. 16), then left the deck unoptimised for 5n reviews (e.g. 100), they should be prompted to re-optimise the deck.
I came to an almost similar conclusion. However, I think that the number should be less than 5n. Maybe 3n or even 2n?
I came to an almost similar conclusion. However, I think that the number should be less than 5n. Maybe 3n or even 2n?
That sounds good. Optimise every time your number of reviews has doubled/tripled, perhaps
Ok, nevermind, those are, indeed, confidence intervals. My bad. So it seems that we need around 50-60 reviews for the parameters to work well on the next 100 reviews, and we need around 200 reviews for the parameters to work well on the next 1000 reviews. So perhaps we should increase the limit for full optimization from 32 to 64 or 128? @RlzHi As for how often to reoptimize, I think 2n is good. But that's up to Dae to decide when to implement a notification.
Sounds great, we can use pretrain up to 64 reviews and fully optimised thereafter. When the number of reviews a user has doubles, we can prompt them to optimise their deck.
A nice feature of optimising every 2x reviews means as the deck becomes larger, we optimise less frequently.
we can use pretrain up to 64 reviews
Looking at the graph (lookahead = 100), even pretrain should not be used before 64 reviews
With 32 reviews, pretrain should perform reasonably up to 132 reviews (100 more). If we optimise every time your number of reviews has doubled, the next optimisation will be at 64 reviews (32 more).
I don't have a graph for lookahead 32, but I do have one for 25:
It looks like we can say with 95% certainty that using pretrain is better in this case.
With 32 reviews, pretrain should perform reasonably up to 132 reviews (100 more).
In the graph with lookahead = 100, the curves and the bands for pretrain
and full optimization
appear to be almost overlapping till about 64 reviews. So, I don't think that there should be any difference in the performance of the two.
If we optimise every time your number of reviews has doubled, the next optimisation will be at 64 reviews (32 more). It looks like we can say with 95% certainty that using pretrain is better in this case.
I agree. But, then this is valid for full optimization also.
Edit: But then, with full optimization, we have the problem of stability of the parameters too. So, yes, the minimum limit to allow full optimization should be higher.
The problem with full optimization is that, in theory, it's less stable than pretrain due to a lack of regularization aka penalty for making the parameters too large/too small. Though it doesn't actually show up in data, it seems.
Sounds good. In which case, do we agree that it's best to:
A nice feature of optimising every 2x reviews means as the deck becomes larger, we optimise less frequently.
It sounds good. I would add that don't even allow pre-train before 32 reviews. The graph with lookahead = 25 shows that about 32 reviews are required to say with 95% certainty that the pretrain parameters would be better than the default ones for the next 25 reviews. This means that below 32 reviews, the user would need to optimize even more frequently than every 2x reviews.
One question: Does the dataset used to prepare these graphs contain same-day reviews, manual revlogs, etc? If not, Anki would have to use the filtered revlog count for deciding whether to allow optimization or not.
So 32 reviews for pretrain and 64 reviews for the full optimization?
The graph below is for 8 reviews, predicting on the next 10. I like the idea of allowing pretrain for very few reviews because it improves the worst-case scenarios. Imagine you have a deck that FSRS is doing poorly with, you'll be stuck doing 32 reviews with low accuracy. In some cases, that could be several days of low-quality estimates. I would rather give the user the choice of optimising in this situation.
Good question, I believe LMSherlock will be the best person to answer that.
I would rather give the user the choice of optimising in this situation.
Ok. We can remove the limits for pre-train and see how it goes. Because we are talking about very small number of reviews, people will actually be able to share their experience with us during the beta cycle.
I would rather give the user the choice of optimising in this situation.
I'd rather not. FSRS is already confusing for most people. We should decrease the cognitive burden for the user, not increase it.
Pretrain outperforms default for a low number of reviews. I imagine most people won't ever optimise FSRS with few reviews unless default is performing badly, and those who take the time to do it will be willing to spend the time to press the optimise button a few more times.
Maybe we are misunderstanding each other. I meant that there shouldn't be a special window, like "Are you sure you want to run the optimizer when the number of reviews is so low?"
Ah yes, I completely agree with you on that.
@user1823 here's the answer to your earlier question: message (1).txt, or https://github.com/open-spaced-repetition/srs-benchmark/blob/a2c7a20996c4ac73492337e7e6919ac5b3aa4c7a/notebook/metric_over_size_comparison_3.txt
Size: 8 Amt where train loss: fullopt < default < pretrain: 28 / 7586 ~ Test loss: 0.210, 0.193, 0.201 Amt where train loss: fullopt < pretrain < default: 815 / 7586 ~ Test loss: 0.250, 0.251, 0.255 Amt where train loss: default < pretrain < fullopt: 4 / 7586 ~ Test loss: 0.281, 0.220, 0.221 Amt where train loss: default < fullopt < pretrain: 24 / 7586 ~ Test loss: 0.319, 0.317, 0.315 Amt where train loss: pretrain < fullopt < default: 87 / 7586 ~ Test loss: 0.280, 0.281, 0.277 Amt where train loss: pretrain < default < fullopt: 1 / 7586 ~ Test loss: 0.240, 0.228, 0.240
They don't add up to 7586 because Rlz used <
, not <=
.
Three conclusions from looking at the file: 1) The "this is how things are supposed to be" permutation, which is fullopt < pretrain < default, is the most commonly occuring one for any n. That's reassuring. 2) The second most common permutation is pretrain < fullopt < default, for any n. This means that we could benefit from adding a comparison of pretrain RMSE and fullopt RMSE. However, the improvement from switching is very small (0.01 in most cases), so it's not worth it. 3) The default < pretrain < fullopt case is usually accompanied by very high loss, in other words, the most intuitively wrong permutation usually has the highest loss.
@RlzHi, would you be interested in doing another analysis?
Currently, we know that it is good to re-optimize the parameters after 2n reviews. But, is more frequent optimization worth it? For example, if I optimize the parameters on 100k reviews, should I think about re-optimizing them only after I have 200k reviews?
Here is how you can go about it:
n
, such as 16, 32, 64, 100, 1000, 10000, 50000, 100000, 200000.n
, select those collections that have at least 2n
reviews.n
, optimize the parameters on the first n
reviews and evaluate them up to the first 2n
reviews.n
, plot a graph between RMSE and the number of reviews (up to 2n
).If the RMSE goes up significantly between n and 2n, we would know that it would be better to re-optimize the parameters in between.
@user1823 Sure, it may take some time to create this one as higher values of n take longer to process. Will tag you when it's finished.
Update: It's taking a very long time to complete, it seems to be because the optimisations I've added don't work with creating this graph (especially for large n). From what I'm seeing so far, re-optimising should be completed every 3x at least, which seems in line with the previous graphs.
Every 2x seems ideal (i.e. if the last optimisation was at 100 cards, it doesn't need to be optimised for the next 200), which has the nice effect of less training being needed as the number of reviews grows.
Let me know if automatic optimisation is being considered and I'll see what I can do to generate these graphs.
Hey everyone,
We currently have a minimum of 1000 reviews for FSRS to be optimised, which I believe is now changing to 400. After some research and discussions with @L-M-Sherlock it seems we can lower this as far as 16, or remove it completely.
I've forked and updated srs-benchmark which contains the benchmarking code for this: https://github.com/RlzHi/srs-benchmark/blob/main/notebook/metric_over_size.ipynb
The dry run is FSRS 4.5 without optimisations, and after approximately 16 reviews optimising FSRS performs better than the default parameters.