Closed dae closed 9 months ago
Previously, you said, "Before we can implement this, we’ll need to know how we should handle RMSE/log-loss moving in different directions." I think the right thing is "Only look at RMSE." The problem with log-loss is that it depends on retention. RMSE (bins), on the other hand, does not depend on retention.
Possible approach we could use:
- and return an error if they're worse
That would just confuse the user. I think instead it's better to just automatically use "Evaluate" with both old and new parameters, and then keep whichever are better. All of that should happen without notifying the user. Even if we notified the user, it wouldn't really help them to make any kind of good decision, it would just make them wonder "Huh? Is the optimizer broken? Is there a problem with my settings?". @L-M-Sherlock what do you think?
Sorry, I should have been more explicit. I was expecting the frontend to catch the returned error, and skip updating the existing weights in that case.
By the way, according to this comment of LMSherlock, it doesn't matter whether the default parameters or the previous ones are used when initializing the optimizer.
The problem with log-loss is that it depends on retention
It doesn't matter in this case, because we evaluate the parameters in the same dataset.
@L-M-Sherlock, do you think that we should only look at the RMSE (bins)? If so, why?
I want to add that I re-optimized my parameters today and got the following result:
In this case, the relative change in RMSE is larger. So, I think that the parameters are worse overall.
But, if the relative change in log loss is larger in magnitude, should we still consider only RMSE to decide whether to update the parameters or not?
According to my previous experiment, the RMSE(bins) can be decreased even when the parameters are changed very slightly: https://github.com/open-spaced-repetition/spaced-repetition-algorithm-metric/blob/main/ICILoss.ipynb
It doesn't matter in this case, because we evaluate the parameters in the same dataset.
It does. If the user's retention consistently goes down, log-loss will keep getting worse after each optimization, bur RMSE (bins) won't necessarily keep getting worse, or at least not as much.
It does. If the user's retention consistently goes down, log-loss will keep getting worse after each optimization, bur RMSE (bins) won't necessarily keep getting worse, or at least not as much.
The log-loss of old parameters is also calculated in the current dataset whatever its retention goes up or down.
I don't understand what you're trying to say.
No reviews are done between the two evaluations. So, the retention must be the same.
Ah. Yeah, that's true. I was talking about a more complicated scenario, where the parameters are re-optimized many times and each time the new log-loss is greater. But it's possible that the new RMSE wouldn't be greater each time.
Well, I didn't get what you are saying. But, you don't need to explain. (Chances are high that I still won't be able to draw any conclusions from that)
Can there be any case where the log loss decreases and the RMSE increases but the parameters are more optimal? If not, we can simply use RMSE as the only criteria to decide whether the new parameters are better or worse.
Can there be any case where the log loss decreases and the RMSE increases but the parameters are more optimal?
Theoretically, there is a case where the log loss increases and the RMSE decreases and the parameters are worse: https://github.com/open-spaced-repetition/spaced-repetition-algorithm-metric/blob/main/metrics_research.ipynb
Edit: The image in the bottom.
It's RMSE(bins) is zero. But its log loss is worse.
By the way, I find a practical one (but I don't know which parameters are better):
Source: https://github.com/open-spaced-repetition/spaced-repetition-algorithm-metric/blob/main/ICILoss.ipynb
The first one's log loss is worse than the second one. But its rmse is better than the second one.
Can you point out which one? Just edit your previous comment to include the example instead of creating a new comment. I don't want to pollute the comments list.
The case of an algorithm giving the same prediction for all the reviews doesn't seem to be practical. Even if there was such a case where a very large number of reviews were put in the same bin, it seems to be difficult to detect.
Of course, we can change the criteria later if we have more insights. But for now, I think that only using RMSE(bins) is the way to go.
All of that should happen without notifying the user.
I think that we can show a message similar to "Your parameters are already optimal." This way,
Well, I didn't get what you are saying. But, you don't need to explain. (Chances are high that I still won't be able to draw any conclusions from that)
I meant a situation like this. As n reviews increases: 1) Retention decreases 2) Log-loss increases 3) RMSE (bins) stays the same
Here's a toy example. The numbers aren't shown because the exact values don't matter, what matter is the slope of the curve. Retention goes down, log-loss goes up, RMSE (bins) stays the same. This seems like a plausible scenario to me, which is why I recommend focusing only on RMSE (bins).
As n reviews increases:
But in the current problem, n_reviews remains the same but the parameters change. (The two evaluations run on exactly the same data, just with different parameters)
So, what you have described doesn't seem to be applicable in the current situation.
I think that we can show a message similar to "Your parameters are already optimal."
I like that idea. So @L-M-Sherlock, me and user1823 agree that only RMSE (and not log-loss) should be used to decide which parameters to keep, so if you don't have any objections, you should probably start working on implementing it. Also, about this: In the latter case, not only RMSE but every other metric is also worse. R-squared is worse, MAE is worse, all ICI-related metrics are worse, and the intercept and slope are further from 0 and 1 (which corresponds to a perfect algorithm). So I would say that the latter parameters are worse.
Fine. I don't have any objections.
SuperMemo says that the universal metric is the best way to compare two spaced-repetition algorithms.
@L-M-Sherlock, can the UM be calculated as quickly as calculating RMSE twice? If so, we can use UM to compare the old and the new weights.
If not, RMSE(bins) would be fine.
can the UM be calculated as quickly as calculating RMSE twice
Of course. I have implemented it in the Python Optimizer. I plan to port it into fsrs-rs.
When implementing this, we'll want to update to fsrs 0.3: https://github.com/open-spaced-repetition/fsrs-rs/pull/154
I'm not sure why you'd want to keep your previous parameters?
Consider the scenario where you've done 1 review and you generate your parameters. The parameters fit your 1 review perfectly, so RMSE of 0 You do 10000 more reviews, and then regenerate the parameters. The new parameters have worse RMSE than the RMSE of the previous parameters when they were generated. It keeps the parameters generated after 1 review.
or does the optimizer not always generate the parameters with the lowest loss for the dataset. I imagine there's some confusion somewhere
The new parameters have worse RMSE than the RMSE of the previous parameters when they were generated.
No, both sets of parameters are evaluated on the full dataset. In your example, both sets of parameters would be evaluated on the 10000 reviews dataset. We do not compare RMSE(old parameters, 1 review) to RMSE(new parameters, 10000 reviews). We compare RMSE(old parameters, 10000 reviews) to RMSE(new parameters, 10000 reviews).
The new parameters have worse RMSE than the RMSE of the previous parameters when they were generated.
No, both sets of parameters are evaluated on the full dataset. In your example, both sets of parameters would be evaluated on the 10000 reviews dataset. We do not compare RMSE(old parameters, 1 review) to RMSE(new parameters, 10000 reviews). We compare RMSE(old parameters, 10000 reviews) to RMSE(new parameters, 10000 reviews).
Thank you for the clarification, I was having a conversation with that prolific r/anki redditor and he didn't have a proper answer for me. So this means it's the case where the optimiser does not always give the optimal parameters for the training data. Why is that (is there a training/validation split)?
I was having a conversation with that prolific r/anki redditor
u/ClarityInMadness? That's me. Sorry if I missed your comment.
So this means it's the case where the optimiser does not always give the optimal parameters for the training data. Why is that?
Honestly, ask @L-M-Sherlock. I don't fully understand why myself.
I was having a conversation with that prolific r/anki redditor
u/ClarityInMadness? That's me. Sorry if I missed your comment.
So this means it's the case where the optimiser does not always give the optimal parameters for the training data. Why is that?
Honestly, ask @L-M-Sherlock. I don't fully understand why myself.
oh this was a while ago, I don't think you knew why at the time. I've just been wondering why we were doing this keeping of previous FSRS parameters since
So this means it's the case where the optimiser does not always give the optimal parameters for the training data. Why is that
Gradient descent-based methods are not guaranteed to find the global optimal parameters.
@L-M-Sherlock, I agree that this hack was needed to deal with the issue that you just mentioned.
However, there is a high probability that the latest version has another issue that makes it more likely to get worse parameters after optimization. For instance, I have never been able to get better parameters than those I got in the end of December (when 23.12.1 was released).
In my opinion, the current issue is not the right place to discuss this further. So, if you want more details and/or want to discuss something, I recommend creating a new issue in the FSRS repo.
So this means it's the case where the optimiser does not always give the optimal parameters for the training data. Why is that
Gradient descent-based methods are not guaranteed to find the global optimal parameters.
Thanks Jarrett, I wasn't sure what kind of optimisation was being done. That makes sense
https://forums.ankiweb.net/t/measures-to-prevent-new-fsrs-parameters-from-being-worse-than-the-old-ones/38957