Parameter recovery simulations (R2.3)

venpopov commented 2 months ago

Reviewer's comments:

R2.3.2: Based on the simulated example shown in the appendix, the authors may find high reliability in recovering individual model parameters with the Bayesian approach. This is good news; however, the contrast between MLE and Bayesian approaches at a suboptimal level of trial count (e.g., 50 trials) seems a bit unfair. I believe a more useful approach is to set the trial count as a parameter and evaluate how well different methods can reproduce the underlying distribution as well as individual model parameters. This demonstration is frequently seen in the literature, providing insights to researchers regarding the minimal trial count needed to provide a good estimate of model parameters.

R2.3.3: Related to this, the authors seem to misinterpret Grange & Moore (2022). In their paper, they find that "For the two-component model, parameter recovery was good with as few as 50 trials (rκ = .82, rpu = .84); for excellent recovery, 200 trials were required for κ, and 500 trials were required for pu." These results, of course, hinge on the range of k — a classic issue of correlation. In any case, these results by no means indicate that "to obtain robust parameter estimates, maximum likelihood estimation requires at least 200 trials* per subject per condition." Also, the example in the appendix shows a very low reliability of MLE kappa estimates (r = 0.40) at 50 trials per condition, indicating that the current implementation might not provide the best individual fits. In practice, we often see reasonable model fit with >100 trials per condition when Pmem <0.6 (as also seen in Grange & Moore, 2022).

R2.3.4: Considering that individual differences are useful to provide additional information about cognitive models in addition to experimental effects between conditions, I wonder if the authors could better address these discrepancies and enlighten the readers with potential means to best leverage the Bayesian approach to probe individual-level parameters.

[ ] For a fixed population parameters, simulate N subjects with Nobs each, varying Nobs
- [ ] mixture2p
- [ ] mixture3p
- [ ] imm
- [ ] sdm
[ ] same but have two conditions and vary effect size, measure FP and FN rates
- [ ] mixture2p
- [ ] mixture3p
- [ ] imm
- [ ] sdm

venpopov commented 2 months ago

@GidonFrischkorn I started doing the parameter recovery simulations, and wanted to get your input on the reporting.

Turns out our comparison in the appendix is a little unfair - it is from just one simulated dataset where the correlation happens to improve for the bmm fit. I noticed that if we generated multiple datasets, the stats vary quite a lot for correlations. So I am doing the parameter recovery by simulating 200 samples from the same parameter sets, and calculating three measures of recovery:

1) Correlation between true and estimated parameters from bmm and from mixturer 2) rmse between subject-level estimates and true parameter in each sample 3) error in estimating the population mean parameter

While the correlation actually doesn't differ much between bmm and ml approaches (it's a bit better, but not as drastically as we reported in the appendix; I'm a little surprised by this), the other two metrics show a drastic improvement.

I am doing this with 50, 100, 200 and 500 observations per participant. Below is how I think to visualize the results (currently just for 50 and 100 and 200, the other last simulations will take a couple more days to finish).

Any thoughts on the visualization?

I think as we discussed we can make a vignette on the bmm website for these, not include them in the paper. But we'll need to remove the previous appendix, so maybe do include them as a new appendix?

venpopov commented 2 months ago

Might also be worth it to simulate inference when we have true differences (or no differences) between conditions, and calculate statistics about False Positives and False Negatives with the 1-step vs 2-step approach

venpopov commented 2 months ago

An alternative visualization:

GidonFrischkorn commented 2 months ago

I like both version, but the second one looks more comprehensive and cleaner to me.

I have one suggestion: maybe we can report Correlation, RMSE, and Bias for both subject level and population level parameters. Given that for each sample size there are 200 simulation run, we can get the correlation and the rmse for the population level parameters too, right.

And then I would suggest having two separate plots for each parameter.

Or do you think this is too much details?

GidonFrischkorn commented 2 months ago

The results that correlations do not differ are indeed surprising. A first idea, I had was that it might be an interaction between shrinkage (reducing between subject variance) together with lower estimation errors for the bmm recovery, whereas the ml recovery has larger between subject variance but also larger estimation errors. Somehow these could result in a similar rank order stability.

Irrespective, this is still a nice result with respect to the individual differences comment. It shows that despite shrinkage the bmm estimation recovers the rank order between subject at least as good as the ml estimation.

GidonFrischkorn commented 2 months ago

Another thought about the correlations: how many subjects have you simulated? 30?

It if is a smaller sample size there could be a lot of instability in estimating the standard deviation of the subject level parameter distribution and thus the shrinkage does not provide the biggest benefit.

To show this, we could consider that for larger sample sizes hierarchical estimation provides a bigger benefit especially with little trials per subject. At least that is something that I would expect.

venpopov commented 2 months ago

I like both version, but the second one looks more comprehensive and cleaner to me.

I have one suggestion: maybe we can report Correlation, RMSE, and Bias for both subject level and population level parameters. Given that for each sample size there are 200 simulation run, we can get the correlation and the rmse for the population level parameters too, right.

And then I would suggest having two separate plots for each parameter.

Or do you think this is too much details?

in this case the population parameter is fixed across the 200 simulations - my goal was to see with the same set of parameters how is the recovery for different samples. So it doesn't make sense to report correlations for population level parameters. What differs across simulations is the subject-level parameters (drawn new for each simulation for the same population distribution) and the data.

I was wondering about running simulation for different population parameters, but I'm worried it will take a really long time, if I do 200 simulations per combination of kappa and pmem and Nobs.

venpopov commented 2 months ago

The results that correlations do not differ are indeed surprising. A first idea, I had was that it might be an interaction between shrinkage (reducing between subject variance) together with lower estimation errors for the bmm recovery, whereas the ml recovery has larger between subject variance but also larger estimation errors. Somehow these could result in a similar rank order stability.

Irrespective, this is still a nice result with respect to the individual differences comment. It shows that despite shrinkage the bmm estimation recovers the rank order between subject at least as good as the ml estimation.

after I thought more about it I am not surprised actually. The shrinkage normalizes the estimates, but there is no way to "gain" information about the subject-level parameters. It makes them more accurate overall, but the rank order can't be much improved because the partial pooling just brings down the magnitude of the estimation error.

I have done a few more simulations, and what we had in the appendix is actually a different case - where we have a factor predictor. In. that case, for the ml approach there are only N observations per condition, but 2*N truely by subject. So there the correlation does improve much more, because the intercept is informed by all the data for the subject, essentially doubling the number of observations. I will report those separately, because that is indeed a very important point.

venpopov commented 2 months ago

Another thought about the correlations: how many subjects have you simulated? 30?

It if is a smaller sample size there could be a lot of instability in estimating the standard deviation of the subject level parameter distribution and thus the shrinkage does not provide the biggest benefit.

To show this, we could consider that for larger sample sizes hierarchical estimation provides a bigger benefit especially with little trials per subject. At least that is something that I would expect.

40 subjects - I matched the previous simulation in the appendix. I'm afraid with much more than that it becomes quite time-intensive to run 200 simulations with many observations. For the 500 obs per participant case it already takes ~950 seconds per simulation. So this simulation alone will take mee 52 hours (currently halfway through).

GidonFrischkorn commented 2 months ago

Ah, I see. We should definately show the differences between recovery from a single condition, which is more closely to the case in which you "just" measure individual differences, compared to the condition difference and the gain in rank order recovery there.

And regarding the number ob subjects, we could maybe do this only for the case with 50 observations to avoid too long runtimes. But I also think this is not too important, so we do not desperately need those simulations.

venpopov commented 2 months ago

cool, thanks for the feedback!

venpopov commented 2 months ago

R2.3.3: Related to this, the authors seem to misinterpret Grange & Moore (2022). In their paper, they find that "For the two-component model, parameter recovery was good with as few as 50 trials (rκ = .82, rpu = .84); for excellent recovery, 200 trials were required for κ, and 500 trials were required for pu." These results, of course, hinge on the range of k — a classic issue of correlation. In any case, these results by no means indicate that "to obtain robust parameter estimates, maximum likelihood estimation requires at least 200 trials* per subject per condition." Also, the example in the appendix shows a very low reliability of MLE kappa estimates (r = 0.40) at 50 trials per condition, indicating that the current implementation might not provide the best individual fits. In practice, we often see reasonable model fit with >100 trials per condition when Pmem <0.6 (as also seen in Grange & Moore, 2022).

This comment struck me during the original review, because my own simulations from years ago before Grange & Moore showed poor recovery with 50 trials. So I repeated Grange & Moore's simulations with one improvement - for each parameter set, I simulate 200 samples instead of just 1. As we saw from the plots above, the correlation can vary quite a lot from sample to sample.

So I did what they did: get a grid of 500 values of kappa (1-16) and pmem (0.6-1), simulate data and fit the model. But I repeated this 200 times, and then we get a distribution for the correlations and rmse (this is now population parameters, no subjects).

Here are the results:

So for 50 observations, their reported correlation of 0.84 is the maximum possible. When we run the simulation many times, the mean correlation for kappa is 0.7, with a a highest density itnerval of 0.15-0.84.

by their own standard, 41% of the time the recovered correlation is only in the "fair" range.

To do:

[ ] Reviewer notes cases for pmem < 0.6. In the above simulation I copied Grange et al and had a grid of kappa (1-16) and pmem (0.6-1). I should probably redo this without a restriction on pmem
[ ] Look at recovery of each parameter depending on the value of the other

GidonFrischkorn / Tutorial-MixtureModel-VWM

Parameter recovery simulations (R2.3) #16