Closed Shicheng-Guo closed 1 year ago
Hi Shicheng,
The main reason for the discrepancy is that the data from the Associations tables in the GWAS Catalog UI are manually curated by our curation team from the published journal article (text, tables, supplementary materials), while the summary statistics are submitted directly to us by the authors of the study. There are a few reasons that the results can differ between the two places, e.g.
In this particular example, the rs7608892 vs. HDL association in the Associations table was curated from Supplementary Table 5 of the paper, which has a p-value of 2.05E-12.
In the FTP folder for this study (http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90239001-GCST90240000/GCST90239649/) there are actually two versions of the sumstats file. The originally-submitted raw file (with_BF_meta-analysis_AFR_EAS_EUR_HIS_SAS_HDL_INV_ALL_with_N_1
and its associated README.txt
) actually includes three different p-value columns: p-value from MR MEGA, GC-corrected p-value from MR-MEGA, and p-value from METAL. The GC-corrected p-value does in fact match the one extracted from the paper. However, it seems that when we created a formatted version of the file to match our standard format at that time, it was the p-value from METAL (which is 1.7e-15) that was chosen to keep as the main p_value column in the new formatted file (GCST90239649_buildGRCh37.tsv
).
In this case you may wish to use the originally-submitted raw file to get a more complete picture of the analysis, or contact the authors for more details.
Note that these kinds of discrepancies should be less of a problem for more recent studies (i.e. submitted 2023 onwards) as we now have a more up-to-date format and require all submitted summary statistics to conform to it at the point of submission.
I hope that helps!
Best wishes, Elliot Sollis GWAS Catalog Curator
Dear Team,
I have a question regarding the GWAS Catalog. I noticed that some of the P-values for the peak variants in the Association table are inconsistent with the P-values in the corresponding full summary statistics (they can differ by several orders of magnitude). Also, some more significant variants near the peak variants are not included in the association table. I'm wondering if there are explanations for these observations?
I've looked at this issue in 3 different studies that I'm interested in, so it doesn't seem to be an isolated case. I'm curious what kinds of detailed processing might be happening between these two data sources.
To give an example, in this study (https://www.ebi.ac.uk/gwas/studies/GCST90239649), rs7608892 vs. HDL is shown as 2e-12 in the table, but it's 1.7e-15 in the downloaded summary statistics. And there are several nearby variants with p-values ~e-16 and LD with rs7608892 <0.3.
I look forward to hearing your valuable thoughts on this!
Best,
Shicheng