FBartos / zcurve

zcurve R package for assessing the reliability and trustworthiness of published literature with the z-curve method
https://fbartos.github.io/zcurve
10 stars 1 forks source link

fit z-curve (mixture model) with all z-values rather than only statsitically significant ones #16

Open Yefeng0920 opened 10 months ago

Yefeng0920 commented 10 months ago

@FBartos @gaborcsardi I would be grateful, if you would like to tell me how to fit a collection of z values without truncation at 1.96. I mean z-curve only uses the statistically significant z-values to fit the mixture model. But how to use all z values regardless of the statistical significance. The reason why I ask this is because I want to test if a dataset without publication bias (this can be guaranteed by Registered Reports), the EDR derived from a mixture model fitted with only statistically significant z-values should be similar to that fitted with all z-values regardless of the statistical significance.

Best, Yefeng

FBartos commented 10 months ago

Hi Yefeng,

You can use the control argument to specify the lower fitting range a in the zcurve() function. See the following example:

library(zcurve)
z <- rnorm(100)
fit <- zcurve(z = z, control = list(a = 0))
summary(fit)
plot(fit)

See ?control_EM for more details.

Hope this helps! Frantisek

Yefeng0920 commented 10 months ago

Hi Frantisek @FBartos , This is quite useful. So let me try to understand the so-called folded truncated distribution. Basically, the raw values are converted into absolute values or magnitude, then constrain the data within a certain range of values. By default, the range is qnorm(0.05/2,lower.tail =F) to 5. Finally, a mixture model with EM estimation is used to fit the truncated values. The reason why only fitting the z values with a nominally statistical significance is that it can account for the publication bias, although I could not quite understand the rationale why this is the case. Do I understand the whole process correctly?

FBartos commented 10 months ago

Yes, that's correct. In short; under the selection for statistical significance, estimating the model only using the statistically significant results with a truncated likelihood allows us to obtain estimates that are unaffected by publication bias. Then, we use the locations of the truncated distributions to extrapolate to statistically non-significant results (which we do not use for estimation as they might be non-representative due to the selection).

Yefeng0920 commented 10 months ago

@FBartos It is really a great idea. But I am still thinking only using the average to summarize the discovery rate or replication rate is not a good way on some occasions. Therefore, it is good to present the whole distribution