CNSGenomics / Heritability_WGS

This repository contains the code used for estimating heritability from WGS data
10 stars 1 forks source link

Memory Issue Encountered in GREML-LDMS #4

Open yjge opened 1 month ago

yjge commented 1 month ago

Hi, I am currently calculating heritability in WGS data using the GREML-LDMS method. However, I am facing an issue at the "segment-based LD score" step while running "gcta64 --bfile test --ld-score-region 200 --out test". My process in this step is consistently being terminated.

I am working with a WGS sample consisting of ~30,000 individuals, and the number of variants that passed QC was ~14,000,000 in a chromosome. My computational memory capacity is ~1TB.

Error log: Calculating LD score between SNPs (block size of 10000Kb with an overlap of 5000Kb between blocks); LD rsq threshold = 0)... Calculating allele frequencies ... terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad alloc ./step1.sh: line 7: 328168 Aborted

Additionally, sometimes I encounter an error message stating 'xxx killed'.

Queries: 1.Considering the sample size of ~30,000 and a computational memory limit of ~1TB, what number of variants should I retain? 2.Should I consider pruning the variants? I noticed that the GREML-LDMS papers (Wainschtein et al. 2022; Selvaraj et al. 2022) conducted LD-pruning, but I am unsure whether I need to prune the variants for LD before computing the segment-based LD score. 3.Is it reasonable to select only a subset of individuals to reduce memory usage?

Thank you for your support.

WPierrick commented 4 weeks ago

Hi, The LD-pruning in Wainschtein et al. 2022 was performed to compute PCs, not for assigning variants in LD-bins for GREML-LDMS. You seem to have a lot of variants on your chromosome, I guess you don't filter on allele frequency? I would suggest you perform LDMS on the variants with a AF threshold (you can then pool all the ultra-rare variants in one GRM if you really want to keep them). You can also select a r^2 threshold (--ld-rsq-cutoff, like 0.02 for example) or a smaller LD block window (typically 1Mb should be enough). If you want, you can also compute all pairwise correlations with plink2 and --parallel and manually re-compute LDscore.