biona001 / ghostknockoff-gwas-reproducibility

2 stars 0 forks source link

directly pass X into GK-lasso? Shouldn't X be inaccessible? #1

Open szcf-weiya opened 3 weeks ago

szcf-weiya commented 3 weeks ago

Hi, I came across your paper

Chen, Z., He, Z., Chu, B. B., Gu, J., Morrison, T., Sabatti, C., & Candès, E. (2024). Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression (arXiv:2402.12724). arXiv. https://doi.org/10.48550/arXiv.2402.12724

it is an exciting work.

When I check the details of the code, I am slightly confused.

It seems that the simulation code pretends that X is known since X and its X_tilde is are passed into the knockoff_sqrt_lasso function.

https://github.com/biona001/ghostknockoff-gwas-reproducibility/blob/79e8e989e43bcc58f3ba5b794001fbdca08bb82f/chen_et_al/auto_code.R#L170-L185

However, X itself should be inaccessible according to the problem setup. Specifically, for the GH-sqrtlasso, it seems that it skips Step 2 mentioned in the paper.

image

Am I missing something? Looking forward to your explanation.

zhaomeng1998 commented 3 weeks ago

Hi, thanks for your interest! In the paper, we showed that the GhostKnockoff-based methods are statistically identical to their corresponding knockoff-based methods with individual data. As a result, for instance, the power and FDR of GK-sqrtlasso are equal to those of knockoffs with square-root lasso feature importance statistic based on individual data. For simplicity, we use the latter to evaluate power and FDR in our simulations. In practice, you should first follow Algorithm 3 and Appendix to generate \widecheck{X} and \widecheck(Y) and then use the code provided. Thanks again for the question!

szcf-weiya commented 3 weeks ago

Hi Zhaomeng, thanks for your explanation! I noticed the equivalence, but I am not sure whether it is a fair comparison (to GK-marginal since it only uses the Z-scores without individual data) when you pretend you know the exact information of X when conducting GK-lasso. In other words, there might be some information loss (or not?) when finding $\mathbf{\check X}$ and $\mathbf{\check Y}$. If you take the information loss into account, can the GK-lasso be always better than GK-marginal?

zhaomeng1998 commented 3 weeks ago

Hi! Using \widecheck{X} and \widecheck(Y) does not incur any information loss, as the final rejection set is statistically identical to that of the corresponding original knockoff procedure. In other words, the power calculated using GK-sqrtlasso with \widecheck{X} and \widecheck(Y) is equal to the power calculated using individual level data with the square-root lasso feature importance statistic. This can be seen (essentially) from Corollary 1 and that the gram matrix of \widecheck{X} and \widecheck(Y) exactly matches that of X and Y (by applying an algorithm to the gram matrix of X and Y). The same conclusion holds for GK-marginal and other GK-based methods mentioned in the paper. Thanks for asking!