Yefeng0920 / replication_EcoEvo_git

0 stars 0 forks source link

mathematical formula #3

Closed Yefeng0920 closed 5 months ago

Yefeng0920 commented 10 months ago

This GitHub issue is used for the mathematical formula involved in the paper. This might need input from @ewvanzwet

I briefly explain my purpose. There are two ways of computing the parameter of interest. The first is based on the simulated data and uses the definition of the parameter of interest to compute it. For example, the following is the definition formula of replication rate:

$$ z \times z{\text{repl}} > 0 \ \ \text{and}\ \ |z{\text{repl}}| > 1.96$$

The corresponding code is: snr=rmix(10^5,p=p,m=m,s=sqrt(sigma^2 - 1)) z.orig=snr + rnorm(10^5) # original z.repl = snr + rnorm(10^5) # replication replicate=(z.orig * z.repl > 0) & (abs(z.repl) > 1.96) mean(replicate) # unconditional probability of replication mean(replicate[abs(z.orig)>1.96]) # conditional probability of replication given |z|>1.96

The second is based on each parameter of interest's conditional distribution of the real data (rather than simulation). For example, we can use the following code to estimate the conditional probability of replication: # z vs. replication z=sort(abs(d$z)) replicate=rep(NA,length(z)) s=sqrt(sigma^2-1) for (i in 1:length(z)){ post=posterior(z[i],p,m,s) pp=post$p pm=post$pm ps=post$ps replicate[i]=1 - pmix(1.96,p=pp,m=pm,s=sqrt(ps^2 +1)) }

The formula relevant to the first type of estimation method is quite clear to me. I would appreciate if @ewvanzwet would like to help derive the formula relevant to the second type of estimation method.

ewvanzwet commented 10 months ago

Hi Yefeng,

Since the distribution of the z-values is a mixture of normals, it follows that the distribution of the SNRs is also a mixture of normals (just subtract 1 from the variances of the components). Moreover, the conditional distribution of the SNR given the observed z-value is also a mixture of normals. The formulas are in the appendix of the paper with Schwab and Senn.

The z-value is just the SNR plus standard normal error. So, the conditional distribution of the z-value of an exact replication study (z_repl) given the observed z-value of the original study (z_orig) is equal to the conditional distribution of the SNR given the observed z-value of the original study plus standard normal error. So that's also a mixture of normals.

Now if you want to condition only on the statistical significance of the original study (not z_orig itself), then you have to integrate this distribution over the conditional distribution of the z_orig given |z_orig|>1.96. For the probability of a significant replication, given the statistical significance of the original study you get

P(|z_repl| > 1.96 | |z_orig| > 1.96) = integral_z_orig P(|z_repl| > 1.96 | z_orig > 1.96) f(z_orig | |z_orig| > 1.96) d z_orig

I do this integration over f(z_orig | |z_orig| > 1.96) by simply averaging over the observed z-values which are larger than 1.96 in absolute value. To write all this out in formulas is "tedious" (as mathematicians would say) and probably not that helpful for our intended audience.

I think it's quite clear that my calculations are correct, because the two methods (Monte Carlo and exact) give almost exactly the same result.

Hope this helps! Erik

From: Yefeng @.> Sent: vrijdag 27 oktober 2023 1:46 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

This GitHub issue is used for the mathematical formula involved in the paper. This might need input from @ewvanzwethttps://github.com/ewvanzwet

I briefly explain my purpose. There are two ways of computing the parameter of interest. The first is based on the simulated data and uses the definition of the parameter of interest to compute it. For example, the following is the definition formula of replication rate:

$$ z \times z{\text{repl}} > 0 \ \ \text{and}\ \ |z{\text{repl}}| > 1.96$$

The corresponding code is: snr=rmix(10^5,p=p,m=m,s=sqrt(sigma^2 - 1)) z.orig=snr + rnorm(10^5) # original z.repl = snr + rnorm(10^5) # replication replicate=(z.orig * z.repl > 0) & (abs(z.repl) > 1.96) mean(replicate) # unconditional probability of replication mean(replicate[abs(z.orig)>1.96]) # conditional probability of replication given |z|>1.96

The second is based on each parameter of interest's conditional distribution of the real data (rather than simulation). For example, we can use the following code to estimate the conditional probability of replication:

z vs. replication z=sort(abs(d$z)) replicate=rep(NA,length(z)) s=sqrt(sigma^2-1) for (i in 1:length(z)){ post=posterior(z[i],p,m,s) pp=post$p pm=post$pm ps=post$ps replicate[i]=1 - pmix(1.96,p=pp,m=pm,s=sqrt(ps^2 +1)) }

The formula relevant to the first type of estimation method is quite clear to me. I would appreciate if @ewvanzwethttps://github.com/ewvanzwet would like to help derive the formula relevant to the second type of estimation method.

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2C7PR76RHKAHWIW5ASTYBLY23AVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DINJTGA3TEMQ. You are receiving this because you were mentioned.Message ID: @.**@.>>

Yefeng0920 commented 10 months ago

Hi Erik @ewvanzwet,

Thanks for your detailed explanation. I agree with you that providing the formula might not be useful. Yes, the two calculation methods converge very well, which gives me high confidence. BTW, would you like to inform me that for a given replication rate, for example, 80%, how to back-calculate the corresponding z value?

Regards, Yefeng

ewvanzwet commented 10 months ago

Why would you want to calculate that?

Usually, people calculate the SNR that is needed for 80% power; that the objective of sample size calculations. You'd have to invert the function that maps the SNR to the power. In R, you could do that as follows

f = function(snr) { power=pnorm(-1.96,snr,1) + 1 - pnorm(1.96,snr,1) power - 0.8 } uniroot(f,lower=0,upper=10)

You find that an SNR of 2.8 (or -2.8) gives you 80% power.

Now suppose (p,m,sigma) are the parameters of the mixture distribution of the z-values. To find the z-value that gives 80% replication probability, you could do

s=sqrt(sigma^2-1) # standard deviations of the SNR f = function(z){ post=posterior(z,p,m,s) pp=post$p pm=post$pm ps=post$ps replicate=1 - pmix(1.96,p=pp,m=pm,s=sqrt(ps^2 +1)) replicate - 0.8 } uniroot(f,lower=0,upper=10)

From: Yefeng @.> Sent: zaterdag 28 oktober 2023 4:54 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Hi Erik @ewvanzwethttps://github.com/ewvanzwet,

Thanks for your detailed explanation. I agree with you that providing the formula might not be useful. Yes, the two calculation methods converge very well, which gives me high confidence. BTW, would you like to inform me that for a given replication rate, for example, 80%, how to back-calculate the corresponding z value?

Regards, Yefeng

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1783675114, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2C5IHOIE4VQDMWR4P4DYBRXSXAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTGY3TKMJRGQ. You are receiving this because you were mentioned.Message ID: @.**@.>>

Yefeng0920 commented 10 months ago

Awesome! @ewvanzwet. The reason I want this is that it has the potential to revolutionize the way researchers determine their sample size when designing their experiments, and it can be a very good implications of our study. As you are aware, traditionally, researchers have relied on power analysis to determine sample sizes for their studies. However, replicate-rate based-sample-size-calculation offers a more straightforward and intuitive indicator for researchers to work with. Of course, this is not the main part of our paper. But can be a nice addition. What do you think?

Regards, Yefeng

ewvanzwet commented 10 months ago

I understand that this is useful when you're planning a replication study. In my paper with Steve Goodman, we do similar calculations. For example, we calculate how much larger (or smaller) the replication study needs to be to have, say, 80% power. See:

How large should the next study be? Predictive power and sample size requirements for replication studieshttps://onlinelibrary.wiley.com/doi/full/10.1002/sim.9406

I don't think this is useful for planning an original study. First of all, we can only (try to) control the SNR - that's what sample size calculations are for. The z-value is only available after the study has been completed. Secondly, the calculations I'm doing all assume that a particular study is "typical" of the field. In other words, I'm viewing a particular study as if it has been randomly drawn from the population of all studies in the field. You can't design a new study to be typical!

From: Yefeng @.> Sent: zaterdag 28 oktober 2023 9:58 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Awesome! @ewvanzwethttps://github.com/ewvanzwet. The reason I want this is that it has the potential to revolutionize the way researchers determine their sample size when designing their experiments, and it can be a very good implications of our study. As you are aware, traditionally, researchers have relied on power analysis to determine sample sizes for their studies. However, replicate-rate based-sample-size-calculation offers a more straightforward and intuitive indicator for researchers to work with. Of course, this is not the main part of our paper. But can be a nice addition. What do you think?

Regards, Yefeng

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1783739982, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2C7XKCTCCZGSNHEIXM3YBS3JLAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBTG4ZTSOJYGI. You are receiving this because you were mentioned.Message ID: @.**@.>>

Yefeng0920 commented 10 months ago

@ewvanzwet Thanks for your explanation. I agree with what you said "it is useful when planning a replication study". I will read your paper mentioned to learn more about it.

Regards, Yefeng

Yefeng0920 commented 10 months ago

Hi Erik (@ewvanzwet), I have questions to ask. Maybe they are silly questions. I have read your paper with Steve Goodman (How large should the next study be? Predictive power and sample size requirements for replication studies). There are two interesting concepts: actual power, and predictive power. The actual power defined in your paper is basically the replication rate, which is the probability of reaching p < 0.05 (two-tailed) with the correct direction. The actual power is defined as the probability of reaching p < 0.05 with the correct direction when the original study is replicated exactly.

I have two questions. First, my understanding is that tour current EcoEvo Z value project is focusing on the actual power (replication rate), which assesses the current state of the replication rate - the replication rate of a collection of existing studies. If we calculate predictive power for the EcoEvo Z value project, what we can get is the replication rate of future studies - the replication rate of replicated studies with the same configurations as the original study. In a word, actual power captures the retrospective replication rate (of existing studies), while predictive power captures the prospective replication rate (of future replicated studies). If we add sample size (and thus increase SNR), we can increase the replication rate of the replicated studies. Do I understand it correctly?

The second question is that whether using the predictive power to plan future sample size is conceptually similar to the idea of back-transform 80% replication rate into sample size (the trick of uniroot function)?

Regards, Yefeng

ewvanzwet commented 10 months ago

Do I understand it correctly?

Yes, that's all correct. You do have to realize that "an exact replication" is largely a theoretical concept, because actual replications show a lot of unexplained variation (heterogeneity).

The second question is that whether using the predictive power to plan future sample size is conceptually similar to the idea of back-transform 80% replication rate into sample size (the trick of uniroot function)?

That's also correct. The paper with Goodman has an online supplementhttps://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fsim.9406&file=SIM_9406_supplement-2022-03-01.pdf with all the code. On p12 you can see me calling the uniroot function.

Erik

From: Yefeng @.> Sent: woensdag 8 november 2023 13:36 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Hi Erik @.***https://github.com/ewvanzwet), I have questions to ask. Maybe they are silly questions. I have read your paper with Steve Goodman (How large should the next study be? Predictive power and sample size requirements for replication studieshttps://onlinelibrary.wiley.com/doi/full/10.1002/sim.9406). There are two interesting concepts: actual power, and predictive power. The actual power defined in your paper is basically the replication rate, which is the probability of reaching p < 0.05 (two-tailed) with the correct direction. The actual power is defined as the probability of reaching p < 0.05 with the correct direction when the original study is replicated exactly.

I have two questions. First, my understanding is that tour current EcoEvo Z value project is focusing on the actual power (replication rate), which assesses the current state of the replication rate - the replication rate of a collection of existing studies. If we calculate predictive power for the EcoEvo Z value project, what we can get is the replication rate of future studies - the replication rate of replicated studies with the same configurations as the original study. In a word, actual power captures the retrospective replication rate (of existing studies), while predictive power captures the prospective replication rate (of future replicated studies). If we add sample size (and thus increase SNR), we can increase the replication rate of the replicated studies. Do I understand it correctly?

The second question is that whether using the predictive power to plan future sample size is conceptually similar to the idea of back-transform 80% replication rate into sample size (the trick of uniroot function)?

Regards, Yefeng

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1801805000, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2C4MNBET7TSE52XUEPLYDN4EFAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBRHAYDKMBQGA. You are receiving this because you were mentioned.Message ID: @.**@.>>

Yefeng0920 commented 10 months ago

Hi Erik (@ewvanzwet),

Awesome! Thanks for your swift reply, further clarification of "exact replication", and online materials.

Cheers, Yefeng

Yefeng0920 commented 10 months ago

Hi Erik @ewvanzwet, I am coming to ask questions again. I would be grateful, if you would like to help confirm. My purpose is to make a figure to show the relationship between replication rate and sample size (of an equal-design two-sample t-test; namely, n1 = n2). My solution is (1) using uniroot() to get z-value associated with a pre-defined replication rate (the trick you told me), (2) assuming certain values of Cohen's d (say small 0.2, medium 0.5, and large 0.7), and use the Cohen's d and calculated z-value to compute degrees of freedom (df), and then minus 2 to get n. This is correct? Also, do you have any better solutions?

Regards, Yefeng

ewvanzwet commented 10 months ago

Hi Yefeng,

My purpose is to make a figure to show the relationship between replication rate and sample size

Unfortunately, I don't think that's possible (or at least it's not easy) because there is no direct relation between the sample size and the replication probability.

There is a 1-1 relation between the replication probability and the SNR. We can estimate the joint distribution of the SNR and the z-value, and so we also have the conditional distribution of the SNR given the z-value. Therefore, we also have the conditional probability of a successful replication given the z-value of the original study. We even have the conditional probability of a successful replication if we make the sample size of the replication larger by some percentage. That's because making the sample size large by, say, a factor of 2 increases the SNR of the replication by a factor sqrt(2).

Best, Erik

From: Yefeng @.> Sent: dinsdag 14 november 2023 12:27 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Hi Erik @ewvanzwethttps://github.com/ewvanzwet, I am coming to ask questions again. I would be grateful, if you would like to help confirm. My purpose is to make a figure to show the relationship between replication rate and sample size (of an equal-design two-sample t-test; namely, n1 = n2). My solution is (1) using uniroot() to get z-value associated with a pre-defined replication rate (the trick you told me), (2) assuming certain values of Cohen's d (say small 0.2, medium 0.5, and large 0.7), and use the Cohen's d and calculated z-value to compute degrees of freedom (df), and then minus 2 to get n. This is correct? Also, do you have any better solutions?

Regards, Yefeng

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1810027802, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2C47ALCOYZ4COWB5DXLYENIRDAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJQGAZDOOBQGI. You are receiving this because you were mentioned.Message ID: @.**@.>>

ewvanzwet commented 10 months ago

Hi Yefeng,

I was thinking it might help if I go over the relation between Cohen's d, the SNR, the z-value, the power and the sample size. I apologize if this is all too familiar!

Cohen's d is the true difference in means (=the true effect I call beta) divided by the standard deviation in the population which I'll call sigma. So, d=beta/sigma.

Now suppose we have two equal groups each of size n, and we estimate the difference in mean by the difference in averages (=the estimated effect which I call b). The standard error of b is sigma*sqrt(2/n).

The SNR is the true effect divided by the standard error of the estimate, so SNR = beta/( sigma*sqrt(2/n)). And since d=beta/sigma, we finally have

SNR=d/sqrt(2/n).

To get 80% power you need SNR=2.8. That's because pnorm(-1.96,2.8,1) + (1 - pnorm(1.96,2.8,1))=0.8. Here I'm using the fact that the z-value is the SNR plus standard normal error.

So, to get 80% power, we put SNR=2.8 into the formula above, and solve for n. We find that we need n=2*(2.8/d)^2 to get 80% power.

Note that I'm ignoring the difference between the t-test and the z-test (degrees of freedom and all that stuff), but the approximation works pretty well. I can compare my simple formula to the "proper" sample size calculation based on the t-test:

sample_size = function(d){power.t.test(delta=d,sd=1,power=0.8,sig=0.05)$n} d=seq(0.2,0.8,0.01) plot(sapply(d,sample_size),2*(2.8/d)^2)

@.***

I hope this helps! Erik

From: Zwet, E.W. van (MSTAT) Sent: dinsdag 14 november 2023 19:52 To: Yefeng0920/replication_EcoEvo_git @.>; Yefeng0920/replication_EcoEvo_git @.> Cc: Mention @.***> Subject: RE: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Hi Yefeng,

My purpose is to make a figure to show the relationship between replication rate and sample size

Unfortunately, I don't think that's possible (or at least it's not easy) because there is no direct relation between the sample size and the replication probability.

There is a 1-1 relation between the replication probability and the SNR. We can estimate the joint distribution of the SNR and the z-value, and so we also have the conditional distribution of the SNR given the z-value. Therefore, we also have the conditional probability of a successful replication given the z-value of the original study. We even have the conditional probability of a successful replication if we make the sample size of the replication larger by some percentage. That's because making the sample size large by, say, a factor of 2 increases the SNR of the replication by a factor sqrt(2).

Best, Erik

From: Yefeng @.**@.>> Sent: dinsdag 14 november 2023 12:27 To: Yefeng0920/replication_EcoEvo_git @.**@.>> Cc: Zwet, E.W. van (MSTAT) @.**@.>>; Mention @.**@.>> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Hi Erik @ewvanzwethttps://github.com/ewvanzwet, I am coming to ask questions again. I would be grateful, if you would like to help confirm. My purpose is to make a figure to show the relationship between replication rate and sample size (of an equal-design two-sample t-test; namely, n1 = n2). My solution is (1) using uniroot() to get z-value associated with a pre-defined replication rate (the trick you told me), (2) assuming certain values of Cohen's d (say small 0.2, medium 0.5, and large 0.7), and use the Cohen's d and calculated z-value to compute degrees of freedom (df), and then minus 2 to get n. This is correct? Also, do you have any better solutions?

Regards, Yefeng

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1810027802, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2C47ALCOYZ4COWB5DXLYENIRDAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJQGAZDOOBQGI. You are receiving this because you were mentioned.Message ID: @.**@.>>

Yefeng0920 commented 10 months ago

Hi Erik @ewvanzwet, that's awesome! All assumptions are reasonable under certain circumstances (equal n, equivalence between z- and t-test), because we do not want to cover every situation in this paper. Only one scenario is enough. Thanks for providing the solution and example R code. That helped a lot. Would it be possible to adapt it to the replication rate? I mean the formula or empirical relationship between n and replication rate (I want to show this figure in the paper). My first reaction is to use a desired replicate rate to get z (or SNR), then to get power, and finally use the power to solve for n (based on your formula). Any more straightforward solution? for this specific question?

Regards, Yefeng

ewvanzwet commented 10 months ago

All assumptions are reasonable under certain circumstances (equal n, equivalence between z- and t-test), because we do not want to cover every situation in this paper. Only one scenario is enough.

Equal n is not really needed because we're working directly with the SNR and the z-value.

I mean the formula or empirical relationship between n and replication rate (I want to show this figure in the paper).

As I said, there is no direct relation between the sample size and the replication rate. There is only a direct relation between the SNR and the replication rate. Since we have the conditional distribution of the SNR given the observed z-value, we also have the conditional probability of successful replication given the observed z-value.

Best regards, Erik

From: Yefeng @.> Sent: woensdag 15 november 2023 12:07 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Hi Erik @ewvanzwethttps://github.com/ewvanzwet, that's awesome! All assumptions are reasonable under certain circumstances (equal n, equivalence between z- and t-test), because we do not want to cover every situation in this paper. Only one scenario is enough. Thanks for providing the solution and example R code. That helped a lot. Would it be possible to adapt it to the replication rate? I mean the formula or empirical relationship between n and replication rate (I want to show this figure in the paper). My first reaction is to use a desired replicate rate to get z (or SNR), then to get power, and finally use the power to solve for n (based on your formula). Any more straightforward solution? for this specific question?

Regards, Yefeng

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1812301052, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2CZEHGAW26A5AIACDUDYESO6PAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJSGMYDCMBVGI. You are receiving this because you were mentioned.Message ID: @.**@.>>

Yefeng0920 commented 10 months ago

Hi Erik @ewvanzwet, Thanks for your patient explanation. I am thinking about it and might come back to this question later - still hope to get an empirical relationship between the two in the form of a curve (I know there is no direct link between the two).

Regards, Yefeng

ewvanzwet commented 10 months ago

There is a 1-1 correspondence between the power (probability of a significant result) and the SNR. In the case of a difference of means with equal groups, I showed that SNR = d/sqrt(2/n)).

You can see from the formula that without d, the sample size doesn't tell you much about the SNR. And so it also doesn't tell you much about the power.

In other words, it wouldn't make much sense to say something like: "given that the sample size is 2x50=100, we expect that the probability of successful replication is 40%." We would be ignoring the effect size!

What we can say is: "Given the observed z-value is 2.1, we the replication probability is 40%. If the replication study would have a sample size that is twice as large, the replication probability would be 60%." (I'm making up these numbers)

From: Yefeng @.> Sent: woensdag 15 november 2023 13:00 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Hi Erik @ewvanzwethttps://github.com/ewvanzwet, Thanks for your patient explanation. I am thinking about it and might come back to this question later - still hope to get an empirical relationship between the two in the form of a curve (I know there is no direct link between the two).

Regards, Yefeng

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1812411362, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2C2KMM44WR3TXI5NLYDYESVCJAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJSGQYTCMZWGI. You are receiving this because you were mentioned.Message ID: @.**@.>>

itchyshin commented 10 months ago

@Yefeng0920 and @ewvanzwet - just catching up on these correspondences (I was away last week) - very illuminating. Great to see the square-root law popping up. @ewvanzwet - @Yefeng0920 is working hard to do the first draft (it needs to be short). We will be going through it soon. Yefeng and I will be away for a conference next week so plenty to catch up on this paper and get this ready for you to see soonish. Thanks again for all your help @ewvanzwet

ewvanzwet commented 9 months ago

Sounds good. Enjoy the conference!

Erik

From: Shinichi Nakagawa @.> Sent: zaterdag 18 november 2023 0:17 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

@Yefeng0920https://github.com/Yefeng0920 and @ewvanzwethttps://github.com/ewvanzwet - just catching up on these correspondences (I was away last week) - very illuminating. Great to see the square-root law popping up. @ewvanzwethttps://github.com/ewvanzwet - @Yefeng0920https://github.com/Yefeng0920 is working hard to do the first draft (it needs to be short). We will be going through it soon. Yefeng and I will be away for a conference next week so plenty to catch up on this paper and get this ready for you to see soonish. Thanks again for all your help @ewvanzwethttps://github.com/ewvanzwet

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1817239098, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2C5QN2MYVPUEZKM7JULYE7V7FAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJXGIZTSMBZHA. You are receiving this because you were mentioned.Message ID: @.**@.>>

Yefeng0920 commented 9 months ago

Hi Erik @ewvanzwet, just came back from the conference with Shinichi @itchyshin.

Now I come back to the discussion of the predictive power or replication design. Two ideas to share, based on our early discussion. The email is a bit long, but very easy for you.

  1. The first is based on your idea of predictive power paper. We got the most common z-value (2.4) in our dataset, and estimated the replication rate (48%) conditional on this z-value. Then, we can have a figure to show the relationship between the sample size multiplier of this z-value and the replication rate. This provides an interesting insight to biologists: given the most common z-value (and thus study design), increasing sample size x (say, 2) times can lead to y% (say, 10) increase in the replication rate. This will also answer that decreasing sampling size x times, how much replication rate will have. This is easy to figure out, because x times larger sample size leads to $\sqrt(x)$ times larger standard error and thus $\sqrt(x)$ times larger z-value

  2. Although there is no 1:1 relationship between sample size and replication rate (or power you mentioned in your earlier email, we can assume some interesting effect size magnitudes, such as Cohen's d benchmarks. We assume d = 0.2, then using the z-value, we can get the associated standard error, whose square is the sampling (error) variance. In the field of meta-analysis, there are Delta-method-based formulas. Using these formulas, we can back-calculate sample size, with known d and sampling variance. Finally, we will get the relationship between the sample size and the replication rate.

To recall, we can approximate the relationship between the replication rate and sample size, which can be represented as relative magnitude (multiplier) and absolute magnitude. What do you think?

Regards, Yefeng

ewvanzwet commented 9 months ago

Hi Yefeng,

Yes, I think item 1 will work. You say:

This is easy to figure out, because x times larger sample size leads to $\sqrt(x)$ times larger standard error and thus $\sqrt(x)$ times larger z-value

But that's not quite right. If the sample size of the replication is x times the original sample size (where x can be less than 1), then the replication standard error will be 1/sqrt(x) times the original standard error, and the replication SNR will be sqrt(x) times the original SNR. From there, you can compute the probability of a significant replication.

I still don't understand item 2. If I know d=0.2, then I don't need to know the z-value of the original study and I also don't need to know anything about the other eco-evo trials. If d=0.2 and I want 80% probability of a significant result, I should just take n=2x394

@.*** If d=0.2, I can also compute the probability of a significant result at any sample size

@.***

@.***

I understand you want to

approximate the relationship between the replication rate and sample size

Do you have the sample size (n) for all the eco-evo trials? If so, then you could perhaps try something like

fit=gam((abs(z) > 1.96) ~ s(n,5), family=binomial) pred=predict(fit,newdata=data.frame(n=10:500),type="response")

This curve (with confidence band) will be an estimate of the conditional probability of a significant result given the sample size. Since we are not taking anything else into account besides the sample size, it is also an estimate of the conditional probability of a significant replication given the sample size.

Erik

From: Yefeng @.> Sent: zondag 26 november 2023 10:24 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Hi Erik @ewvanzwethttps://github.com/ewvanzwet, just came back from the conference with Shinichi @itchyshinhttps://github.com/itchyshin.

Now I come back to the discussion of the predictive power or replication design. Two ideas to share, based on our early discussion. The email is a bit long, but very easy for you.

  1. The first is based on your idea of predictive power paper. We got the most common z-value (2.4) in our dataset, and estimated the replication rate (48%) conditional on this z-value. Then, we can have a figure to show the relationship between the sample size multiplier of this z-value and the replication rate. This provides an interesting insight to biologists: given the most common z-value (and thus study design), increasing sample size x (say, 2) times can lead to y% (say, 10) increase in the replication rate. This will also answer that decreasing sampling size x times, how much replication rate will have. This is easy to figure out, because x times larger sample size leads to $\sqrt(x)$ times larger standard error and thus $\sqrt(x)$ times larger z-value
  2. Although there is no 1:1 relationship between sample size and replication rate (or power you mentioned in your earlier email, we can assume some interesting effect size magnitudes, such as Cohen's d benchmarks. We assume d = 0.2, then using the z-value, we can get the associated standard error, whose square is the sampling (error) variance. In the field of meta-analysis, there are Delta-method-based formulas. Using these formulas, we can back-calculate sample size, with known d and sampling variance. Finally, we will get the relationship between the sample size and the replication rate.

To recall, we can approximate the relationship between the replication rate and sample size, which can be represented as relative magnitude (multiplier) and absolute magnitude. What do you think?

Regards, Yefeng

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1826732781, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2CY5WGVUBFC3LRM2CFDYGMDCBAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWG4ZTENZYGE. You are receiving this because you were mentioned.Message ID: @.**@.>>

Yefeng0920 commented 9 months ago

Hi Erik @ewvanzwet, thanks for your swift reply and detailed explanation. Now I understand your point regarding item 2. I decided to leave it because we do not have sample size information in this dataset. Focusing on item 1 (sample size multiplier) is still very good. I would be grateful if you would like to help with the implementation of item 1 in R:

multiplier=seq(0.1,20,by=0.1) # setting a series of multipliers z=sort(z_mode/multiplier) # z_mode is the mode of z-value replicate=rep(NA,length(z)) s=sqrt(sigma^2-1) for (i in 1:length(z)){ post=posterior(z[i],p,m,s) pp=post$p pm=post$pm ps=post$ps replicate[i]=1 - pmix(1.96,p=pp,m=pm,s=sqrt(ps^2 +1)) }`

Then we visualize the relationship between multiplier and replicate. I this correct? I know one proper way to do it is to (1) re-calculate z-value based on adapted standard error (original error times 1/sqrt(multiplier)), (2) re-fit a 4-component mixture distribution, (3) get conditional and joint distribution, (4) estimate replication rate. But what if I only want to look at the multiplier applying to the mode of z-value. I mean whether the above code research my purpose?

Regards, Yefeng

ewvanzwet commented 9 months ago

I'm actually still not quite sure what you are trying to accomplish. Earlier you wrote:

This provides an interesting insight to biologists: given the most common z-value (and thus study design), increasing sample size x (say, 2) times can lead to y% (say, 10) increase in the replication rate.

If you fix the z-value at 2.4, then you can calculate large the (predictive) power of the replication when you multiply the sample size of the original study by some factor. So, you could make a graph of the sample size multiplier versus the predictive power. The graph will show: If the observed z-value is 2.4, then increasing the sample size by a factor x will give you predictive power y. When the multiplier is 1, you simply get the predictive power of an "exact" replication.

You could also divide the predictive power on the y-axis by the predictive power of an "exact" replication. The you'll get a graph of the sample size multiplier versus the predictive power multiplier. That is, if the observed z-value is 2.4, then increasing the sample size by a factor x will increase the predictive power by a factor y.

I don't think your code is correct, because I don't think you can simply multiply the observe z-value by some factor. Have a look at p 11-12 of the online supplement of my paper with Goodman:

predpow=function(z,mult){ # compute predictive power when original experiment

produced z, and we multiply the sample size for

                      # the replication

z=abs(z) pr=dmix(z,p,m,sigma) / (dmix(z,p,m,sigma) + dmix(-z,p,m,sigma)) # pr(z >0 | |z|) pr=drop(pr) post=posterior( z,p,m,tau) # p(SNR|z= |z|) pm=sqrt(mult)post$m ps=sqrt(mult)post$s powpos=1 - pmix(1.96,p=post$p,m=pm,s=sqrt(ps?2 + 1)) # signif given z=|z| post=posterior(-z,p,m,tau) # p(SNR|z=-|z|) pm=sqrt(mult)post$m ps=sqrt(mult)post$s powneg= pmix(-1.96,p=post$p,m=pm,s=sqrt(ps?2 + 1)) # signif given z=-|z| prpowpos + (1-pr)powneg # signif given |z| }

This function is a little more complicated than necessary because it also works when the distribution of the SNR is not symmetric. Anyway, you can now compute the predictive power when z=2.4, and your replication is (say) twice as large:

predpow(z=2.4,mult=2)

Erik

From: Yefeng @.> Sent: zondag 26 november 2023 14:28 To: Yefeng0920/replication_EcoEvo_git @.> Cc: Zwet, E.W. van (MSTAT) @.>; Mention @.> Subject: Re: [Yefeng0920/replication_EcoEvo_git] mathematical formula (Issue #3)

Hi Erik @ewvanzwethttps://github.com/ewvanzwet, thanks for your swift reply and detailed explanation. Now I understand your point regarding item 2. I decided to leave it because we do not have sample size information in this dataset. Focusing on item 1 (sample size multiplier) is still very good. I would be grateful if you would like to help with the implementation of item 1 in R:

multiplier=seq(0.1,20,by=0.1) # setting a series of multipliers z=sort(z_mode/multiplier) # z_mode is the mode of z-value replicate=rep(NA,length(z)) s=sqrt(sigma^2-1) for (i in 1:length(z)){ post=posterior(z[i],p,m,s) pp=post$p pm=post$pm ps=post$ps replicate[i]=1 - pmix(1.96,p=pp,m=pm,s=sqrt(ps^2 +1)) }`

Then we visualize the relationship between multiplier and replicate. I this correct? I know one proper way to do it is to (1) re-calculate z-value based on adapted standard error (original error times 1/sqrt(multiplier)), (2) re-fit a 4-component mixture distribution, (3) get conditional and joint distribution, (4) estimate replication rate. But what if I only want to look at the multiplier applying to the mode of z-value. I mean whether the above code research my purpose?

Regards, Yefeng

- Reply to this email directly, view it on GitHubhttps://github.com/Yefeng0920/replication_EcoEvo_git/issues/3#issuecomment-1826784752, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKTA2CYO4ZJYWSBVFTBGN5DYGM7VRAVCNFSM6AAAAAA6R7LH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWG44DINZVGI. You are receiving this because you were mentioned.Message ID: @.**@.>>

Yefeng0920 commented 9 months ago

Hi Erik @ewvanzwet, Thanks for correcting me. I initially misunderstood that the z-value was larger by a factor of sqrt(multiplier). You indeed mentioned this probably two times, but I neglected it. Only when I looked at your code, I realized my misunderstanding. I also looked at the supplements of your predictive power, where the explanation was quite clear along with the R function. Awesome, it helped me a lot.

Regards, Yefeng