Closed BramVanroy closed 11 months ago
Interesting, do you have a paper where the "85" really is the estimated mean and not just the sample statistic? I guess this discussion may be related, where the highest voted answers says
... The bootstrapped mean value is not a better estimator for your population parameter....
(than the sample statistic "result")
Technically, I think the bootstrap mean could be simply calculated from the bootstrap_distribution
attribute of the scipy res
object that you get from the scipy bootstrap, so it should be easy to implement.
Maybe it can look like that:
"result": 81.3,
"ci": [
80.67,
81.89
],
"mean": x
but seeing this I think that it would maybe add more complexity in understanding/reporting the results than it may help?
I agree that it might be confusing, and to be honest I am not 100% my request makes sense from a statistical perspective. I was creating the following graph.
These were created with the output of smatchpp, so "result" at the top and the the ci within brackets. In terms of notation, it would be easier to be able to write mean +- 2*stdev
. But that is not exactly the case, as an example:
73.4
. But the midpoint between 72.7 and 74.0 would be 73.5
(73.5+-0.65
)While these are very close, they are not exactly in the middle because the CI is calculated through bootstrapping and result
is just the single calculation for the whole population, if I understand correctly. So in terms of notation we can't rightly report it as 73.5+-0.65
(because the this 73.5 is not the mean of the bootstrap but the independet calculation on the full poplation).
I am not sure whether it is clear what I am trying to say, sorry!
I think it is clear what you want to say :-) You want to use the +- notation and then there are problems
boostrap mean != sample statistic, which would make +- weird
confidence intervals can also be asymetric, which would also make +- weird
While I think the standard deviation can also be easily obtained with scipy, the confidence interval is understood as more informative. So maybe it can help to slightly change a notation?
What I found nice is the notation that my colleague used in a recent paper (I have seen others use it too). It looks like this:
The tiny left number is the lower confidence interval, the number in the middle the basic sample statistic, and the right tiny number the upper confidence interval.
If I understand the output correctly, when we are bootstrapping we get results like this:
I think that the
result
is calculated independently, on the full corpus, and ci is the 95% CI min/max. It would be useful to also include the estimated mean based on the bootstrap. As far as I can tell, this is common in research papers too, where you report "85 +- 1.2" where 85 is the estimated mean and 1.2 the CI range with 95% confidence.