cumc / bioworkflows

In-house computational biology workflows at Columbia Neurology
30 stars 43 forks source link

Sample size for LD estimation (EUR) #106

Open Shicheng-Guo opened 2 years ago

Shicheng-Guo commented 2 years ago

I notice you selected a random subset of unrelated samples. two questions:

1) for EUR population, who dataset you used? 1000G_CEU, hapmap_CEU_r23a_filtered, UK10K, HRC reference panel? 2) for EUR population, did you estimated the minimum sample size to receive stable LD estimation for lead SNP identification?

Thanks.

Shicheng

Shicheng-Guo commented 2 years ago

BTW: Is there any GTEx-V8-pre-calculated clumped SNPs to download directly?

gaow commented 2 years ago

@Shicheng-Guo which workflow are you referring to? In our applications we mostly have the matching genotypes so we don't really use reference panels as far as I can recall, for most workflows in this repo.

Shicheng-Guo commented 2 years ago

Thanks Gao for your response. I mean the workflow below:

https://github.com/cumc/bioworkflows/blob/master/GWAS/LD_Clumping.ipynb

Thanks

Shicheng

Shicheng-Guo commented 2 years ago

I notice lots of papers use 1000Genme-EUR as reference, however, I prefer to use UKB-WGS individual data as reference. my question is what's the best sample size to use? 150K WGS data will make the process very time-consuming while sample number sample size may cause biased LD-clumping.

@Shicheng-Guo which workflow are you referring to? In our applications we mostly have the matching genotypes so we don't really use reference panels as far as I can recall, for most workflows in this repo.

gaow commented 2 years ago

@Shicheng-Guo our LD clumping application was for association analysis with UK Biobank data -- that was why we selected subsets of UKB genotypes and used that as reference panel. We used 2000 samples I believe.

I don't think LD clumping is as picky as eg fine-mapping applications in terms of LD panel. Since our application was on UKB data itself, we believe 2000 samples is good enough of an approximation. We don't have the reference for GTEx V8 data. I have not formally assessed it, but if you are concerned, perhaps you can take a few regions of UKB data, try computing LD panel from sample sizes 500 to 10K see how robust your estimates are?