Closed mykmal closed 1 year ago
Hey! Thanks for the detailed report. Some things to note:
Moreover, I think that get_genome_build() should fail if no matches are found, rather than defaulting to GRCh38 (which is the current behavior). I agree, I have added this fix - if no match is found in either genome build MSS will now fail.
incorrectly concludes the reference genome for a GWAS summary statistics file that has chromosome numbers coded with the "chr" prefix I have added a check for this in the check genome build function so it should work as expected now but let me know if it doesn't?
Instead, I suggest rearranging the QC steps in format_sumstats() to put all QC steps that don't rely on the GWAS assembly before the call to get_genome_build() I had a look through all our 60+ checks and the vast majority require a genome reference. Moreover, these checks are interdependent so even if a check itself doesn't require a genome reference, a check it relies on might. This gets quite complicated so I have avoided moving any checks for SNP, CHR and BP to above the inference of genome build for now. However, do let me know if you come across any other examples other than the 'CHR' prefix which causes an issue with inferring the genome build and I'll be happy to update.
These changes have been added to MSS v1.9.9. You can install this from Github now or wait a few days for it to go up on the devel version of Bioconductor. Let me know if your issue persists (feel free to reopen this issue if so).
Cheers, Alan.
Thank you for fixing this bug so quickly! The updated version now works correctly on my example GWAS file.
As for rearranging the QC checks, I can see why doing so would be more complicated than I initially imagined. I'll let you know if I run into any other GWAS formats that break get_genome_build()
.
1. Bug description
The
get_genome_build()
function assumes that the CHR, BP, and SNP columns are already properly formatted when attempting to match them with the GRCh37 and GRCh38 reference genomes. However,get_genome_build()
gets called byformat_sumstats()
before those columns are standardized, causing assembly inference to fail for GWAS summary statistics files that don't already match the MungeSumstats format.As an illustration of this issue, the example below shows that
format_sumstats()
incorrectly concludes the reference genome for a GWAS summary statistics file that has chromosome numbers coded with the "chr" prefix. In this example,get_genome_build()
actually fails to find any matching SNPs for either one of the reference genomes.Suggested solution
Adding additional QC checks to
get_genome_build()
is one possible solution, but that would duplicate the steps informat_sumstats()
. Instead, I suggest rearranging the QC steps informat_sumstats()
to put all QC steps that don't rely on the GWAS assembly before the call toget_genome_build()
. Moreover, I think thatget_genome_build()
should fail if no matches are found, rather than defaulting to GRCh38 (which is the current behavior).Console output
Expected behaviour
I manually confirmed that my GWAS file adheres to the GRCh37 genome assembly. MSS should be able to identify this file as GRCh37, and then perform liftover to convert it to GRCh38.
2. Reproducible example
Code
Data
Head of the original GWAS summary statistics file:
Additional troubleshooting
To better understand the issue, I injected
browser()
right after the following lines: https://github.com/neurogenomics/MungeSumstats/blob/329185602f294413c02b06452e4c0f72171dd5e5/R/get_genome_build.R#L140-L150Surprisingly, zero matches were found on either reference genome:
But after removing the "chr" prefix from each entry in the CHR column of
sumstats
, we get the expected behavior:3. Session info