FBartos / zcurve

zcurve R package for assessing the reliability and trustworthiness of published literature with the z-curve method
https://fbartos.github.io/zcurve
10 stars 1 forks source link

How does z-curve deal with censored p-values? #20

Open LukasWallrich opened 7 months ago

LukasWallrich commented 7 months ago

@FBartos thank you for this package!

I was wondering whether you describe anywhere how z-curve deals with censored p-values? I can't find it in the papers or the package docs, but might have overlooked something? We are in the process of writing a Registered Report using zcurve, and need to explain that there.

My understanding so far is that ps > .05 are just ignored. But what happens with ps < .05? Are they just used in the EM algorithm as is, with the bounds that are passed? I see some transformation steps in the code that I can't quite figure out - are they just about ensuring that the lower bound is above 0?

Finally (and feel free to ignore this part as it is not about zcurve per se), might you be able to sense-check my understanding of how EM deals with censored values? My understanding is that on each iteration, the model essentially predicts the exact value of the censored observations based on the current parameter estimates, then updates the model parameters to maximise the log-likelihood, and then iterates again. Does that sound right?

FBartos commented 6 months ago

Dear Lukas,

In statistics, censored observations are commonly treated using a censored likelihood. The most common example would be survival analysis, where some survival times are known to be larger than the end of follow-up. Concerning z-curve, this corresponds to some p-values being known to be < 0.01, < 0.001 etc... We describe the censored likelihood approach in the following paper https://doi.org/10.1371/journal.pone.0290084 (you might also check https://doi.org/10.1093/biostatistics/kxt007 for a simpler model and more references).

Regarding usage of p-values. z-curve uses only statistically significant p-values (i.e., all p-values larger than 0.05 are not used for fitting the model unless you adjust the fitting range manually). Then, p-values censored to be < 0.05 are not really informative as they can be whatever value in the fitting range, and all the remaining censored values are used for fitting the model (there are some additional restrictions of the upper fitting range and usage of commutative probability for p-values lower than that to ensure convergence of the models).

I'm not an EM algorithm expert to be honest. Conceptually, I would describe it that the algorithm predicts and optimizes the overall likelihood of all values, with the censored values being optimized via the censored likelihood (instead of the precise value).

Hope this helps! Frantisek