carlaml / LDpred-funct

Here we develop the software for the method LDpredfunct described in https://www.biorxiv.org/content/early/2018/07/24/375337
GNU General Public License v3.0
17 stars 7 forks source link

Questions about LD files to use #2

Open alesss78 opened 5 years ago

alesss78 commented 5 years ago

I am trying to use LDpred-funct in order to generate a polygenic risk score from a summary statistic. I am following the tutorial reported here but I have some questions:

At first, I used S-LDSC in order to calculate regression coefficients. The github tutorial page of S-LDSC suggests to use as baseline model LD scores the ones in the file: 1000G_Phase1_baseline_ldscores.tgz downloaded from https://data.broadinstitute.org/alkesgroup/LDSCORE/

I managed to run S-LDSC and to obtain the result files. However, at the point of calculating the per -SNP heritability, present tutorial says to use baselineLD.*.annot.gz files, in order to extract the so called X matrices.

1) From which compressed tgz file, baselineLD.*.annot.gz files have to be taken? I see different possibilities in the download page: 1000G_Phase3_baselineLD_ldscores.tgz 1000G_Phase3_baselineLD_v1.1_ldscores.tgz, 1000G_Phase3_baselineLD_v2.0_ldscores.tgz, 1000G_Phase3_baselineLD_v2.1_ldscores.tgz

2) In the S-LDSC step when calculating the regression coefficients, I used a different baseline model LD file (baselineLD.*.annot.gz) than those reported above. Is this correct or possibly I should have used one of the baselineLD file above? (in case, which one?)

3) The tutorial says to divide regression coefficients by the parameter h2g, the latter being taken from the .log file generated by S-LDSC. In the .log file I don't see any h2g parameter, but rather a h2 parameter. Is that the correct one to be taken?

Thank you in advance for the answers to these questions

carlaml commented 5 years ago

Hi, 1. 1000G_Phase3_baselineLD_v1.1_ldscores.tgz was the one used in my paper. The differences between each of this files is described in the README file in https://data.broadinstitute.org/alkesgroup/LDSCORE/readme_baseline_versions

  1. Which model did you use? It is expected to see a moderate difference in prediction R2 between the models you use depending on the set of annotations that you are using (see Table S8 from my manuscript).

  2. Yes, you can use h2 from the log file.

alesss78 commented 5 years ago

Hi Carla Thank you for you response. I am experiencing a memory error while trying to run LDpred-funct. I am using imputed data (about 90 milion variant, unfiltered) of about 400000 individuals and I run Ldpred-funct on a Google machine with 16 cores and 512 gigas or Ram.

Importantly: At the moment of the memory error, Ldpred-funct was using only 271 gigas out of the 512 gigas available. Additionally, the memory error occurs regardless of the training sample parameter (Ldpred-funct Input parameter N.) value used: I tried both N=400000 and N=10

The other inputs of LDpred-funct are:

Ldpred-funct generates a memory error problem while calculating the PRS for Chromosome 1.

We have installed numpy version 1.14.5 and scipy version 1.2.1**

I would kindly ask for your help to overcome this problem. The logs of Ldpred-funct are as following:

{'ld_radius': None, 'verbose': False, 'skip_coordination': False, 'K': None, 'posterior_means': '/opt/LDpred-funct/output/ldpredfunct_posterior_means', 'H2': 0.0686, 'coord': '/ldpred-morespace/coordinate/Coord_Final', 'N': 10, 'gmdir': None, 'gf': '/opt/LDpred-funct/ldpred-links/Final.chrom[1:22]', 'chisq': False, 'pf': '/opt/LDpred-funct/pheno', 'skip_ambiguous': False, 'maf': 0.01, 'ssf': '/opt/LDpred-funct/statfile', 'FUNCT_FILE': '/opt/LDpred-funct/funcfile', 'check_mafs': False, 'out': '/opt/LDpred-funct/output/ldpredfunct_prs'} Step 1: Coordinate summary statistics, genotype and functional enrichments files.

Parsing SNP list Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom1.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom2.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom3.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom4.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom5.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom6.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom7.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom8.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom9.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom10.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom11.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom12.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom13.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom14.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom15.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom16.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom17.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom18.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom19.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom20.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom21.bim Parsing bim file: /opt/LDpred-funct/ldpred-links/Final.chrom22.bim Total 92646145 SNPs in all genotypes files. Parsing the SNP-Heritability file: /opt/LDpred-funct/funcfile rs575272151 3.81811453166102e-06

Get intersection between SNPs in genotypes file and functional enrichments file. 9946704 found in both files Parsing the summary statistics file: /opt/LDpred-funct/statfile 0 SNPs with p-value rounded to zero in summary statistics file SS file loaded, now sorting and storing in HDF5 file. 505424 SNPs on chromosome 1 Still 505424 SNPs on chromosome 1 546373 SNPs on chromosome 2 Still 546373 SNPs on chromosome 2 467098 SNPs on chromosome 3 Still 467098 SNPs on chromosome 3 476461 SNPs on chromosome 4 Still 476461 SNPs on chromosome 4 416892 SNPs on chromosome 5 Still 416892 SNPs on chromosome 5 415593 SNPs on chromosome 6 Still 415593 SNPs on chromosome 6 381258 SNPs on chromosome 7 Still 381258 SNPs on chromosome 7 357813 SNPs on chromosome 8 Still 357813 SNPs on chromosome 8 276716 SNPs on chromosome 9 Still 276716 SNPs on chromosome 9 336863 SNPs on chromosome 10 Still 336863 SNPs on chromosome 10 325501 SNPs on chromosome 11 Still 325501 SNPs on chromosome 11 310625 SNPs on chromosome 12 Still 310625 SNPs on chromosome 12 245054 SNPs on chromosome 13 Still 245054 SNPs on chromosome 13 214610 SNPs on chromosome 14 Still 214610 SNPs on chromosome 14 184209 SNPs on chromosome 15 Still 184209 SNPs on chromosome 15 192858 SNPs on chromosome 16 Still 192858 SNPs on chromosome 16 169921 SNPs on chromosome 17 Still 169921 SNPs on chromosome 17 190234 SNPs on chromosome 18 Still 190234 SNPs on chromosome 18 146027 SNPs on chromosome 19 Still 146027 SNPs on chromosome 19 148753 SNPs on chromosome 20 Still 148753 SNPs on chromosome 20 91425 SNPs on chromosome 21 Still 91425 SNPs on chromosome 21 89999 SNPs on chromosome 22 Still 89999 SNPs on chromosome 22 6489707 SNPs parsed from summary statistics file. Coordinating summary statistics and genotypes Unable to find phenotype values. Working on chromsome: chrom_1 Bim file information loaded Found 505424 SNPs present in both genotype and summary statistics datasets 0 SNPs were excluded due to ambiguous nucleotides. 1 SNPs were excluded due to non-matching nucleotides. Iterating over file to genotypes 13171 SNPs with MAF < 0.010 were filtered 492252 SNPs were retained on chromosome 1. Traceback (most recent call last): File "/opt/LDpred-funct/ldpredfunct.py", line 163, in main() File "/opt/LDpred-funct/ldpredfunct.py", line 128, in main method="STANDARD_FUNCT", skip_ambiguous=p_dict['skip_ambiguous']) File "/opt/LDpred-funct/coord_genotypes_ldpredfunct_v1_2.py", line 635, in coordinate_genot_ss rb_prs = sp.dot(sp.transpose(raw_snps), log_odds) MemoryError

Thank you in advance for your time!