Closed lottensmann closed 5 years ago
Hi Linda,
Thanks for letting me know -- I've uploaded the tissue expression data to the FTP site. FYI, we didn't use that output for anything Benchmarker-related, which is probably why I forgot; when we used tissue expression data as input for gene prioritization, we put it in the gene set enrichment field in the .cfg file, which DEPICT uses as the basis for prioritization. Whatever is in the tissue expression file will be used in generating the .tissuenenrichment.txt file. That said, you might of course still find it interesting!
With regard to MAGMA, I haven't used the newer version with Benchmarker much, although I have used it for some other types of analyses recently. I believe I did at one point compare 1.06b and 1.07 gene-level and gene-set-level results for one trait, and I think they were fairly correlated. My strong suspicion is that the overall results will be reasonably concordant, but you can always confirm that by just running one or two similar analyses to what we did and checking that the results are similar. I recently added a file to the FTP site containing the genes prioritized by several of our main analyses, so that may be helpful to you. (And of course Benchmarker theoretically could tell you which version was performing better, at least when used in conjunction with our gene prioritization algorithm!)
Hope that helps! Rebecca
Hi Linda,
Let me address your first question first. How different are your BMI DEPICT prioritization results from the ones on the Github? DEPICT will give slightly different results each time you run it (due to the permutations), so if they're very highly correlated then it may just be that. Also, if you're looking at the FDRs, make sure you fix the missing tab issue with FDRs >=0.20 (it's described in the README of this repository -- unfortunately I can't fix the problem myself, since DEPICT isn't my software).
With respect to the second question, just to clarify: in our manuscript, we compared DEPICT and MAGMA in a couple of different ways. In the direct head-to-head comparison, we did use a version of DEPICT where we prioritized based on the gene set enrichment results, just like we did with MAGMA. In that case, as you said, you should be able to use the same script for prioritization of DEPICT gene set enrichment results as MAGMA gene set enrichment results, though you'll have to change the arguments to account for the different formatting of the gene set enrichment files (I think it's the src/prioritize_genes_from_enrichment_results.py script). I believe you should be able to fill in the arguments for that script such that it works for DEPICT _genesetenrichment.txt files as well, but let me know if that's not the case (I'd quickly spot-check the results and make sure they look right). I just made a couple of small tweaks to that script to make sure it should work for DEPICT files (fixed an issue with specifying tab as the delimiter, and now it should force you to give it 22 input gene set enrichment files).
We also compared DEPICT and MAGMA where we ran DEPICT normally (i.e. using the gene prioritization results); specifically, we did that in the model where we directly compared DEPICT, MAGMA, and NetWAS. So just to be clear, you can still compare DEPICT and MAGMA when running DEPICT normally, we just did it the other way for sort of the academic purpose of being as strict as possible about comparing algorithms rather than data types.
Best, Rebecca
Hi Rebecca,
thanks so much for your reply! You were right about the DEPICT results, the DEPICT p-values were a bit different but the final results of the benchmarker analysis with depict based on the gene prioritization results and the results of the analysis with depict based on the gene set enrichment results were very similar to those reported in the paper and also the final results of the benchmarker run with magma for all 6 binarizations were very close to the results in the paper. The results I mentioned above are for the bmi summary stats that are provided in the github.
However, when I repeated the analysis with the schizophrenia summary stats I got results that are different to those reported in the paper for magma and depict based on gene set enrichment results (I have not tried depict based on gene prioritization for this dataset).
Just to make sure that I did the analysis for depict based on gene set enrichment results correctly: The head of the resulting file of my 4_prioritize_genes_from enrichment_results_depict.sh script looks like this:
Trait Label Chromosome Ensembl_Gene HUGO_Gene Set_Name scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000152092 ASTN1 MP:0001454 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000143786 CNIH3 MP:0001454 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000117152 RGS4 MP:0001454 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000117600 ENSG00000117600 MP:0001469 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000121905 HPCA MP:0001469 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000198216 CACNA1E MP:0001469 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000121753 BAI2 MP:0001469 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000033122 LRRC7 MP:0001469 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000116254 CHD5 GO:0019717 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000187730 GABRD GO:0030425 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000116983 HPCAL4 GO:0030425 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000162728 KCNJ9 GO:0030425 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000116544 DLGAP3 GO:0030425 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000008118 CAMK1G MP:0001473 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000203685 C1orf95 MP:0001473 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000117598 ENSG00000117598 GO:0016358 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000198929 NOS1AP GO:0016358 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000225206 MIR2682 GO:0016358 scz_2014_EUR Top50GenesPerGeneSet 1 ENSG00000162374 ELAVL4 GO:0030424
Do you think I used the script correctly?
I’m now thinking that since the results are only different to those in the paper for the schizophrenia dataset that maybe I am not using the summary stats correctly. I downloaded the file daner_PGC_SCZ49.sh2_mds10_1000G-frq_2 and used the arguments --N-cas 33640 --N-con 43456 when running the munge_sumstats script for LDSC (I also tried adding the argument –daner but that didn’t lead to different results). I had to exclude SNPs with OR > 1e300 because otherwise the script didn’t run because those values couldn’t be read by python. I only had to exclude 38 SNPs so I think it shouldn’t make a big difference but just in case I wanted to ask if you also excluded those?
Best, Linda
Hi Linda,
I'm not sure, but let me ask a few follow-up questions. 1) The results you copied and pasted are from the top-50-genes binarized DEPICT run (not MAGMA), right? If so, those look at least reasonably similar to my results, but again, they're not directly comparable because DEPICT will give slightly different answers every time. 2) I looked back at how I processed the schizophrenia summary statistics, and I used a flag of --N 77096, which should be equivalent to your --N-cas/--N-con. I did apply a MAF > 1% filter first because I noticed in that particular GWAS, there were some fairly wild and impossibly large ORs which were obviously due to very low MAF (e.g. rs148760206, rs59574201). I don't know how much that would affect the ultimate results, especially since LDSC already removes SNPs absent from 1000 Genomes, but it might be worth trying that to see if it helps. 3) How different were your results? And which results specifically were they -- was your calculated coefficient very different from the one I reported?
Best, Rebecca
Hi Rebecca,
here are some results of my analyses: Schizophrenia Top 200 Genes per gene set
Individual DEPICT: prop SNPs,prop h2, coefficient, coefficient se, coefficient z-score, coefficient normalised, coefficient normalised se my old results: 0.137,0.2273,3.37E-08,7.59E-09,4.4449,0.5041,0.1134 my new results: 0.137,0.2248,3.18E-08,7.35E-09,4.3203,0.4758,0.1101 results in the paper: 0.142,0.2514,4.75E-08,7.50E-09,6.3269,0.709,0.1121
Individual MAGMA: my old results: 0.104,0.1906,4.27E-08,7.69E-09,5.555,0.6436,0.1159 my new results: 0.105,0.1918,4.38E-08,7.74E-09,5.5994,0.6548,0.1169 results in the paper: 0.1314,0.2317,4.31E-08,7.35E-09,5.8674,0.6482,0.1105
joint model DEPICT: old results: 0.137,0.2276,2.3E-08,8.35E-09,2.7592,0.34535,0.12516 new results: 0.137,0.2246,2.0E-08,7.87E-09,2.5899,0.3062,0.11823 paper: 0.142,0.2517,3.5E-08,9.58E-09,3.6499,0.52179,0.14296
joint model MAGMA: old results: 0.105,0.1896,3.3E-08,8.51E-09,3.9188,0.4997,0.1275 new results: 0.105,0.1908,3.5E-08,8.45E-09,4.1475,0.5264,0.1269 paper: 0.1314,0.2304,2.28E-08,9.47E-09,2.4132,0.34099,0.1413
jackknife values DEPICT v. MAGMA: se,z score,p value old results: 0.21435,-0.72027,0.47135 new results: 0.20512,-1.07348,0.28306 paper: 0.25672,0.70427,0.48127
So the maf filtering did not change the coefficients a lot and for the Top200GenesPerGeneSet definition file only the coefficient for individual MAGMA analysis matches the results in the paper. For the other definition files I will only list the normalised coefficient values and se for the individual DEPICT and MAGMA analyses here, so the message does not get too long:
coefficient normalised, coefficient normalised se
Top 100:
individual DEPICT: old results: 0.5707,0.1104 new results: 0.4429,0.1098 paper: 0.6855,0.1123
individual MAGMA: old results: 0.7564,0.1150 new results: 0.7516,0.1160 paper: 0.6809,0.1112
Top 50:
individual DEPICT: old results: 0.3742,0.0984 new results: 0.4197,0.0966 paper: 0.5875,0.1105
individual MAGMA: old results: 0.5605,0.1193 new results: 0.5683,0.1234 paper: 0.6337,0.1125
Top Z 1.96:
Individual DEPICT: old results: 0.6465,0.1099 new results: 0.6709,0.1063 paper: 0.6421,0.1053
Individual MAGMA: old results: 0.5561,0.1093 new results: 0.5635,0.1093 paper: 0.7132,0.1031
Top Z 2.58:
Individual DEPICT: old results: 0.5911,0.1141 new results: 0.5824,0.1172 paper: 0.7449,0.1145
Individual MAGMA: old results: 0.6910,0.1100 new results: 0.6964,0.1104 paper: 0.848,0.109
Top Z 3.29:
Individual DEPICT: old results: 0.6387,0.1075 new results: 0.5488,0.1068 paper: 0.7649,0.1095
Individual MAGMA: old results: 0.7077,0.1165 new results: 0.7117,0.117 paper: 0.675,0.1116
So to sum up the DEPICT coefficient are always outside the +/- se range except for Top Z 1.96, and the MAGMA coefficients are mostly inside the range except for Top Z 1.96 and Top Z 2.58.
I actually also ran the analysis for the BMI gwas and compared the results with those of the paper, for those the DEPICT and MAGMA coefficients were always inside the +/- se range except for the DEPICT norm. coefficient for Top Z 1.96(I got 0.268,0.0884 and the results in the paper were 0.11985,0.0922).
Thank you for your help and I am sorry for the long message!
Best, Linda
Hi Linda,
I'm thinking about possible explanations for this. Here's one thought -- what GWAS p-value cutoff did you use for DEPICT (in the .cfg file for DEPICT, association_pvalue_cutoff)? In the paper I used 1e-5; is it possible you're using 5e-8? Seems like the issue is more substantially with DEPICT than MAGMA overall.
Also, here is my munge_sumstats log output from processing the schizophrenia GWAS. Does anything here look substantially different? (The input file name is different because I do a pre-processing step of matching rsID names with chrom:pos:alphabeticalA1:alphabeticalA2).
Call: ./munge_sumstats.py \ --out scz_ripke_forLDSC \ --merge-alleles ../../ldsc/data/phase_1/w_hm3.snplist \ --N 77096.0 \ --snp rsID_1kgP1 \ --sumstats scz_ripke_formattedGwas_withrsIDs.txt.gz
Interpreting column names as follows: info: INFO score (imputation quality; higher --> better imputation) rsID_1kgP1: Variant ID (e.g., rs number) A1: Allele 1, interpreted as ref allele for signed sumstat. Pval: p-Value A2: Allele 2, interpreted as non-ref allele for signed sumstat. OR: Odds ratio (1 --> no effect; above 1 --> A1 is risk increasing)
Reading list of SNPs for allele merge from ../../ldsc/data/phase_1/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from scz_ripke_formattedGwas_withrsIDs.txt.gz into memory 5000000 SNPs at a time. .. done Read 9409799 SNPs from --sumstats file. Removed 7260847 SNPs not in --merge-alleles. Removed 961924 SNPs with missing values. Removed 63100 SNPs with INFO <= 0.9. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 0 variants that were not SNPs or were strand-ambiguous. 1123928 SNPs remain. Removed 0 SNPs with duplicated rs numbers (1123928 SNPs remain). Using N = 77096.0 Median value of OR was 1.0, which seems sensible. Removed 0 SNPs whose alleles did not match --merge-alleles (1123928 SNPs remain). Writing summary statistics for 1217311 SNPs (1123928 with nonmissing beta) to scz_ripke_forLDSC.sumstats.gz.
Metadata: Mean chi^2 = 1.798 Lambda GC = 1.592 Max chi^2 = 120.758 1725 Genome-wide significant SNPs (some may have been removed by filtering).
Conversion finished at Thu Oct 4 10:38:26 2018 Total time elapsed: 44.03s
Best, Rebecca
Also, it's probably worth thinking about if the inconsistencies are coming from the actual original DEPICT/MAGMA results or from the gene prioritization step (i.e. prioritizing based on the binarized gene sets). Do you also have issues when using the *_geneprioritization.txt output files from DEPICT for prioritization?
Hi Rebecca,
Thank you this is really helpful! I actually used 5e-8 instead of 1e-5 as the threshold in the .cfg file so this probably explains some of the differences. I am now running DEPICT again with the correct threshold.
My LDSC munge sumstats log looks like this:
Call: ./munge_sumstats.py \ --out scz_2014_EUR_forLDSC \ --merge-alleles /fs/projects/gwas_summary/SMR/lottensm/w_hm3.snplist \ --N 77096.0 \ --sumstats scz_2014_eur_maf_filtered.txt
Interpreting column names as follows: INFO: INFO score (imputation quality; higher --> better imputation) A1: Allele 1, interpreted as ref allele for signed sumstat. P: p-Value A2: Allele 2, interpreted as non-ref allele for signed sumstat. SNP: Variant ID (e.g., rs number) OR: Odds ratio (1 --> no effect; above 1 --> A1 is risk increasing)
Reading list of SNPs for allele merge from /fs/projects/gwas_summary/SMR/lottensm/w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from scz_2014_eur_maf_filtered.txt into memory 5000000 SNPs at a time. ... done Read 11834939 SNPs from --sumstats file. Removed 10636118 SNPs not in --merge-alleles. Removed 0 SNPs with missing values. Removed 67891 SNPs with INFO <= 0.9. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with out-of-bounds p-values. Removed 0 variants that were not SNPs or were strand-ambiguous. 1130930 SNPs remain. Removed 0 SNPs with duplicated rs numbers (1130930 SNPs remain). Using N = 77096.0 Median value of OR was 1.0, which seems sensible. Removed 0 SNPs whose alleles did not match --merge-alleles (1130930 SNPs remain). Writing summary statistics for 1217311 SNPs (1130930 with nonmissing beta) to scz_2014_EUR_forLDSC.sumstats.gz.
Metadata: Mean chi^2 = 1.793 Lambda GC = 1.587 Max chi^2 = 120.758 1725 Genome-wide significant SNPs (some may have been removed by filtering).
So there are some differences to your log file, it seems that I have about 7,000 more SNPs with non missing beta. Maybe the reason for the differences are that I didn't do the pre-processing step of matching rsID names with chrom:pos:alphabeticalA1:alphabeticalA2. I will try to do that as well and check if that changes the results. Did you use a specific program for matching the rsIDs with the positions?
Yes that is a good point, I will check if the inconsistencies are caused by the gene prioritization step. I have only tried running non-binarized DEPICT for BMI and there the results were quite close to those in the paper (I got 0.3686 and the result in the paper was 0.3332), but I will also try that for schizophrenia and then use the correct DEPICT threshold.
Best, Linda
Hi Rebecca,
my individual DEPICT results for scz with the threshold 1e-5 are as follows:
coefficient normalised, coefficient normalised se
Top 50: my results: 0.5787,0.1045 paper: 0.5875,0.1105
Top 100: my results: 0.6533,0.1102 paper: 0.6855,0.1123
Top 200: my results: 0.6559,0.1106 paper: 0.709,0.1121
Top Z 1.96: my results: 0.6373,0.1007 paper: 0.6421,0.1053
Top Z 2.58: my results: 0.7649,0.1089 paper: 0.7449,0.1145
Top Z 3.29: my results: 0.7353,0.1104 paper: 0.7649,0.1095
So after using the correct threshold the results are now quite close to those in the paper. However, I still have the issue that 2 of the 6 MAGMA values for SCZ (Top Z 2.58 and Top Z 1.96) are outside the +/- SE range. I am now thinking that this could be caused by the pre-processing step of matching rsID names with chrom:pos:alphabeticalA1:alphabeticalA2 that I did not perform. Did you also do that for the BMI summary statistics? Maybe that could then also explain why one value was different for the Benchmarker run with BMI. Did you you match rsID names with chrom:pos:alphabeticalA1:alphabeticalA2 for all SNPs or only for those where no rsID was given in the summary stats?
Best, Linda
I have finished the benchmarker analysis using the geneprioritization.txt output files from DEPICT for prioritization:
BMI:
individual DEPICT: (coefficient norm, coefficient norm se) my results: 0.3686,0.1015 paper: 0.3332,0.0902
individual MAGMA (Top Z 2.58): my results: 0.4279,0.079 paper: 0.3947,0.0752
joint analysis DEPICT v MAGMA Top Z 2.58:
DEPICT: my results: 0.2340,0.1061 paper: 0.1872,0.094
MAGMA: my results: 0.3586,0.0824 paper: 0.29548,0.077655
jackknife: (se, z score, p value) my results: 0.1520,-0.8196,0.4124 paper: 0.1333,-0.81254,0.41648
So for BMI all values are inside the +/- se range!
SCZ:
individual DEPICT: (coefficient norm, coefficient norm se) my results: 0.7977,0.1135 paper: 0.8091,0.114
individual MAGMA (Top Z 2.58): my results: 0.76,0.11 paper: 0.848,0.109
joint model of DEPICT v MAGMA:
DEPICT: my results: 0.549,0.14 paper: 0.5152,0.1292
MAGMA: my results: 0.499,0.14 paper: 0.6242,0.127
jackknife: (se, z score, p value) my results: 0.245,0.204,0.838 paper: 0.2204,-0.4947,0.6208
So also for SCZ the values are inside the range.
However, in the joint analysis for binarized DEPICT v. MAGMA for BMI and SCZ there are still some values outside the range so this could maybe be caused by the filtering of the SCZ GWAS for matching rsID names with chrom:pos:alphabeticalA1:alphabeticalA2 which I did not do yet (if you also did the matching for the BMI GWAS). I tried the matching for the SCZ GWAS with European 1000G phase 3 and phase 1 v 3 data and also filtering for maf > 1%. I am getting 10,337,940 SNPs for phase 3 and 9,890,746 SNPs for phase 1 but in your log file I saw that your GWAS contains 9,409,799 SNPs (961,924 with missing values) so I am not doing the filtering in the same way. Could you please let me know how you did the filtering if you have time and you think this could be the reason for the different results? Then I will rerun the analysis with the filtered GWAS.
Another reason could of course then be the DEPICT binarization as you suggested since for unbinarized DEPICT the values are inside the range. Here is the my 4_prioritize_genes_from_enrichment_results_depict.sh script, do you see some possible error causes in the script?
trait=scz_2014_EUR /homes/lottensm/anaconda2/bin/python ../benchmarker/src/prioritize_genes_from_enrichment_results.py \ --output_directory . \ --trait ${trait} \ --results_file ${trait}_depict_noChr@_genesetenrichment_corrected.txt \ --gene_boundary_file ../benchmarker/data/GPL570ProbeENSGInfo+HGNC_reformatted_noMHC_depictAndGtexGeneIntersection_RF.txt \ --set_definitions_file /fs/projects/gwas_summary/SMR/lottensm/$1_DEPICTGenes_Ensembl_depictAndGtexGeneIntersection.txt \ --set_definitions_label $1 \ --output_label depict \ --percentage_cutoff 10 \ --results_file_set_col Original_gene_set_ID \ --results_file_col_to_sort_on Nominal_P_value \ --sort_direction ascending \ --results_file_separator 'tab'
Best, Linda
Hi Linda,
For my filtering step, I used 1000 Genomes phase 1 data and made unique identifiers with chrom:pos_alphabeticalA1_alphabeticalA2. I made the same unique identifiers in the schizophrenia GWAS. Then, I matched on those unique identifiers and took the rsIDs from 1000kg phase 1, which I then subsequently used as input for munge_sumstats.py. Also, I think for MAGMA I did the same thing but using 1000kg phase 3 for rsIDs, since their genotype data is phase 3. For MAGMA I also removed SNPs with info <= 0.3 before running it.
I’m not sure if these things will make a difference, but it’s worth a shot. I'll try to think more about what else could be causing the differences. To be clear, the only things that seem to still be an issue are the joint binarized DEPICT/MAGMA results from BMI and scz? Is that with all the binarizations or just one or two of them?
Also, I think this is unlikely the source of a problem, but just to be sure -- for binarized MAGMA, the input files for the gene set enrichment should be a gene-gene set matrix containing z-scores, while the prioritization step files should be lists of binarized gene sets (where each gene set has listed gene set members). Just double-checking that the first step involves non-binarized gene sets and the second binarized.
Best, Rebecca
Oh, are you sure you're looking at the right tables for the joint models? I think some of the values you mentioned come from Table S11, which is the one where MAGMA and DEPICT and NetWAS were all modeled jointly (which will affect all the values), but Table S7 has the values from just MAGMA and DEPICT together.
Hi Rebecca,
thank you for the information! Just to be sure, did you also use the filtered GWAS, which you used as the input for munge_sumstats.py, for running DEPICT?
Yes that's correct, the differences only occur for a few binarizations: for SCZ the binarizations Top200 and TopZ_2.58 show some differences and for BMI the binarization TopZ_1.96.
That's a good point! I think I only took the values from Table S11 for comparing unbinarized DEPICT and MAGMA TopZ_2.58, because I did not find another table for those values. Can those values be used for this purpose? In the table there were the jackknife values for the DEPICT v MAGMA analysis listed but I guess since the results are for the joint model of MAGMA, DEPICT and NetWAS the normalised tau values for DEPICT and MAGMA that are listed there are not comparable with those of a joint model of DEPICT and MAGMA?
Best, Linda
I'm going to write you an email and close this issue because it's easier to communicate that way and because the initial issue with the tissue expression file is solved!
Hi Rebecca, thank you sharing your program! I am going through the tutorial and I can't find the tissue_expression_file: data/tissue_expression/GPL570EnsemblGeneExpressionPerTissue_DEPICT20130820_z_withmeshheader_GTExGenesOnly.txt that is needed in the DEPICT cfg file. Could you let me know where the file is located or upload it?
on another note: in the tutorial you are using magma version 1.06b but on the MAGMA website currently only version 1.07b is available. I had to change the --gene-covar part of the magma command in 3_gene_set_enrichment_analysis.sh from: --gene-covar ../data/GPL570-GPL96-GPL1261-GPL1355TermGeneZScores-MGI_MF_CC_RT_IW_BP_KEGG_z_z_GTExGenesOnly.txt onesided=greater to: --gene-covar /fs/projects/gwas_summary/SMR/lottensm/benchmarker/data/GPL570-GPL96-GPL1261-GPL1355TermGeneZScores-MGI_MF_CC_RT_IW_BP_KEGG_z_z_GTExGenesOnly.txt
because onesided=greater is the default in the new version. Do you have experience with using Benchmarker with this new version of MAGMA and know if using v1.07b instead of v1.06b creates quite similar or different results?
Best regards, Linda