Closed machow closed 5 years ago
Hi there, thanks for the report.
There's definitely something fishy going on with the estimator. It seems like I was trying to implement some kind of early stopping: theta values are checked from left to right and, when the log-likelihood of these values went up, but then started going down, the next candidate theta values didn't need to be checked. However, the implementation seemed to be wrong.
Also, the new intervals I generated in subsequent rounds were also wrong, as you pointed out. All of this made the estimator not be as precise as it can be.
I decided to remove the early stopping idea and implemented something a little more straightforward. After each iteration, the new candidate theta values to be checked are generated around the current best theta, but in smaller intervals. I added some comments in the code and used more descriptive variable names, since it has been a long time since I last worked in this code.
I committed a possible fix to the testing branch. Take a look if it makes sense to you. The lines you were interested in are now here. An example verbose output from the hill climbing estimator can be seen below. Values with larger log-likelihood are printed and we can now see that subsequent passes do generate better thetas.
Pass: 1
Bounds: -1.735914394995148 0.9443143368283956
Interval size: 0.2978031924248381
Theta: -1.735914394995148, LL update: inf
Theta: -1.43811120257031, LL update: 0.026153563041037664
Pass: 2
Bounds: -1.735914394995148 -1.140308010145472
Interval size: 0.06617848720551955
Pass: 3
Bounds: -1.5042896897758296 -1.3719327153647904
Interval size: 0.014706330490115382
Theta: -1.4307580373252522, LL update: 3.4473263991863234e-05
Pass: 4
Bounds: -1.4454643678153676 -1.4160517068351368
Interval size: 0.0032680734422478874
Theta: -1.4291240006041284, LL update: 3.219923880415365e-06
Theta: -1.4258559271618805, LL update: 1.5761288620907976e-06
Pass: 5
Bounds: -1.4291240006041284 -1.4225878537196326
Interval size: 0.0007262385427218021
Theta: -1.4269452849759632, LL update: 1.9605989365345522e-07
Pass: 6
Bounds: -1.427671523518685 -1.4262190464332414
Interval size: 0.00016138634282714115
Theta: -1.4268645918045497, LL update: 1.023564344393435e-08
Theta: -1.4267032054617226, LL update: 8.590400391028652e-09
Pass: 7
Bounds: -1.4268645918045497 -1.4265418191188954
Interval size: 3.5863631739463386e-05
Pass: 8
Bounds: -1.426739069093462 -1.4266673418299831
Interval size: 7.969695942078303e-06
Theta: -1.4266992206137514, LL update: 1.1692868895352149e-11
Pass: 9
Bounds: -1.4267071903096935 -1.4266912509178094
Interval size: 1.771043542708739e-06
Theta: -1.4266983350919802, LL update: 1.2860823517257813e-12
Thanks for your quick response! I think the way you did it previously, going...
seemed like a good, slightly more efficient strategy, given that the likelihood is concave (at least for 2PL model, off-the-cuff I think this is the case for 3 and 4PL? Could be very wrong about this?). Once the next candidate has a lower likelihood, then the likelihood has started going down, so the rest will also be lower (and don't need to be checked).
As long as the best theta is reset each round it should be okay (since it will pass over the previously best theta with increasing granularity).
I see what you mean. If I reset best_theta
and max_ll
in every search pass and if I am sure that the current best theta is included in all future search passes, I am guaranteed to at least find the same best_theta
every time, but also have the chance to find a better one.
if I am sure that the current best theta is included in all future search passes
The thing is, I never assumed this condition. If you have observed it, it was there by coincidence. I also didn't want to reset best_theta
and max_ll
on every search pass because, if there was a chance that a better theta was found in a previous search pass, the function would return that value and not the best value found in the current pass.
I decided to just explicitly keep the value of the current log-likelihood (current_ll
) and the value of the log-likelihood from the previous theta (previous_ll
) and, if current_ll < previous_ll
, it means the function is only going to go downhill, so we can stop the search and start another pass.
The implementation of early stopping is here.
I can think of ways to do it better, like search once to the left and once to the right of the current best theta, in order to know which side we actually have to search. This can cut the search space in half and it's not hard to implement.
I'll close this as it appears the issue has been fixed. If you find other anomalies, please report them and I'll try to fix them as quickly as possible. We can continue discussion on this bug here, nevertheless.
Thanks for the fix!
That's fair that there could be some small improvements. I guess knowing whether the crnt best theta is to the left or right of the best theta out of the crnt round's values contains a lot of information for limiting the search range.
As is, is super helpful--thanks again for being so responsive!
Hey, I think that oftentimes the HillClimbingEstimator only does 1 round of estimation (rather than 10, iteratively more precise rounds). The issue is...
I need to patch within the next couple days on DataCamp/catsim, so can submit a PR to fix!
Solution
Code here:
https://github.com/douglasrizzo/catsim/blob/master/catsim/estimation.py#L138-L139
seems like the solution is that max_ll needs to be reset each round?