Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

`format_sumstats` doesn't detect lowercase alleles (with fix) #145

Closed HaglundA closed 1 year ago

HaglundA commented 1 year ago

Let me preface by saying I love the package! It's been incredibly arduous to harmonize gwas summary stats, I'm so glad I've come across this package. I was unaware it existed and so have created many functions for my own use (liftover, detect build, harmonize alleles, convert chrpos to rsid, etc...) but they are only a fraction of what this package offers.

format_sumstats doesn't seem to detect lowercase alleles. Fixed with base R toupper. I've attached two sample datasets (before and after toupper of the Allele1 and Allele2 columns) - in my own harmonization scripts, I've added the following lines to make sure they are uppercase;

    if(stringr::str_detect(gwas$A1,"[[:upper:]]")[1]==FALSE & stringr::str_detect(gwas$A2,"[[:upper:]]")[1]==FALSE){
      gwas$A1<-toupper(gwas$A1)
      gwas$A2<-toupper(gwas$A2)
    }

Hope this helps.

sample_gwas.txt sample_gwas_toupper.txt

Console output, raw GWAS

MungeSumstats::format_sumstats(path=paste0(gwas_dir,"sample_gwas.txt"))

******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deletedupon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ),  or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************

Formatted summary statistics will be saved to ==> /var/tmp/pbs.7110028.pbs/RtmphXWtj0/file18d4bf1bbba14d.tsv.gz

Reading header.

Tabular format detected.

Importing tabular file: /rds/general/user/ah3918/projects/single_cell_eqtl/ephemeral/alex/LOUWAI/SCGSuite/inputs/GWAS/sample_gwas.txt

Standardising column headers.

First line of summary statistics file: 

Chr Pos Allele1 Allele2 rsID    Effect  P   Direction   HetISq  N   

Summary statistics report:
   - 1,000 rows
   - 1,000 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes

Checking for multi-GWAS.

Checking for multiple RSIDs on one row.

Inferring genome build.

Loading reference genome data.

Loading reference genome data.

Inferred genome build: GRCH37

Checking SNP RSIDs.

Checking for merged allele column.

Ensuring all SNPs are on the reference genome.

Loading reference genome data.

Checking for correct direction of A1 (reference) and A2 (alternative allele).

There are 1,000 SNPs where neither A1 nor A2 match the reference genome. These will be removed.

Reordering so first three column headers are SNP, CHR and BP in this order.

Reordering so the fourth and fifth columns are A1 and A2.

Checking for missing data.

Checking for duplicate columns.

Warning message in min(sumstats_dt$P, na.rm = TRUE):
“no non-missing arguments to min; returning Inf”
Checking for duplicate SNPs from SNP ID.

Checking for SNPs with duplicated base-pair positions.

Ensuring all SNPs have N<5 std dev above mean.

Removing 'chr' prefix from CHR.

Making X/Y CHR uppercase.

Checking for bi-allelic SNPs.

N already exists within sumstats_dt.

Sorting coordinates.

Writing in tabular format ==> /var/tmp/pbs.7110028.pbs/RtmphXWtj0/file18d4bf1bbba14d.tsv.gz

Summary statistics report:
   - 0 rows (0% of original 1,000 rows)
   - 0 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 0 chromosomes

Successfully finished preparing sumstats file, preview:

Reading header.

   SNP CHR BP A1 A2 BETA P DIRECTION HETISQ N
1: SNP CHR BP A1 A2 BETA P DIRECTION HETISQ N

Returning path to saved data.

Console output, toupper GWAS allele columns

MungeSumstats::format_sumstats(path=paste0(gwas_dir,"sample_gwas_toupper.txt"))

******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deletedupon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ),  or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************

Formatted summary statistics will be saved to ==> /var/tmp/pbs.7110028.pbs/RtmphXWtj0/file18d4bf3275a500.tsv.gz

Reading header.

Tabular format detected.

Importing tabular file: /rds/general/user/ah3918/projects/single_cell_eqtl/ephemeral/alex/LOUWAI/SCGSuite/inputs/GWAS/sample_gwas_toupper.txt

Standardising column headers.

First line of summary statistics file: 

Chr Pos Allele1 Allele2 rsID    Effect  P   Direction   HetISq  N   

Summary statistics report:
   - 1,000 rows
   - 1,000 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes

Checking for multi-GWAS.

Checking for multiple RSIDs on one row.

Inferring genome build.

Loading reference genome data.

Loading reference genome data.

Inferred genome build: GRCH37

Checking SNP RSIDs.

Checking for merged allele column.

Ensuring all SNPs are on the reference genome.

Loading reference genome data.

Checking for correct direction of A1 (reference) and A2 (alternative allele).

There are 610 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.

Reordering so first three column headers are SNP, CHR and BP in this order.

Reordering so the fourth and fifth columns are A1 and A2.

Checking for missing data.

Checking for duplicate columns.

Checking for duplicate SNPs from SNP ID.

Checking for SNPs with duplicated base-pair positions.

Ensuring all SNPs have N<5 std dev above mean.

Removing 'chr' prefix from CHR.

Making X/Y CHR uppercase.

Checking for bi-allelic SNPs.

41 SNPs are non-biallelic. These will be removed.

N already exists within sumstats_dt.

Sorting coordinates.

Writing in tabular format ==> /var/tmp/pbs.7110028.pbs/RtmphXWtj0/file18d4bf3275a500.tsv.gz

Summary statistics report:
   - 959 rows (95.9% of original 1,000 rows)
   - 959 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes

Successfully finished preparing sumstats file, preview:

Reading header.

           SNP CHR     BP A1 A2    BETA      P DIRECTION HETISQ    N
1:  rs12238997   1 693731  A  G -0.1278 0.5031        +-   72.7 1459
2:  rs72631875   1 705882  G  A -0.1496 0.5819        ++    0.0 1459
3:  rs55727773   1 706368  A  G -0.0230 0.8601        -+   41.5 1458
4: rs149887893   1 714596  T  C -0.1132 0.7465        --    0.0 1459

Returning path to saved data.
Al-Murphy commented 1 year ago

Hey! Delighted to hear you find the package useful and that it extends some of the functionality you made yourself!

I believe this issue is just caused by you using an older version of MSS, what version do you have installed? It runs as expected for me:

> packageVersion("MungeSumstats")
[1] ‘1.6.0’

This is the current bioconductor release version.

The run:

> MungeSumstats::format_sumstats(path=paste0("~/Downloads/","sample_gwas.txt"))

******::NOTE::******
 - Formatted results will be saved to `tempdir()` by default.
 - This means all formatted summary stats will be deleted upon ending the R session.
 - To keep formatted summary stats, change `save_path`  ( e.g. `save_path=file.path('./formatted',basename(path))` ),   or make sure to copy files elsewhere after processing  ( e.g. `file.copy(save_path, './formatted/' )`.
 ******************** 

Formatted summary statistics will be saved to ==>  /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpTSv9nO/file165f14cb3e231.tsv.gz
Importing tabular file: ~/Downloads/sample_gwas.txt
Checking for empty columns.
Standardising column headers.
First line of summary statistics file: 
Chr Pos Allele1 Allele2 rsID    Effect  P   Direction   HetISq  N   
Summary statistics report:
   - 1,000 rows
   - 1,000 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1,000 SNPs using BSgenome::snpsById...
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval,
    evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order,
    paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

BSgenome::snpsById done in 292 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 293 seconds.
Inferred genome build: GRCH37
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 136 seconds.
2 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 118 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 610 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
SE is not present but can be imputed with BETA & P. Set impute_se=TRUE and rerun to do this.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
524 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpTSv9nO/file165f14cb3e231.tsv.gz
Summary statistics report:
   - 476 rows (47.6% of original 1,000 rows)
   - 476 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 1 chromosomes
Done munging in 14.11 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
          SNP CHR     BP A1 A2    BETA      P DIRECTION HETISQT    N
1: rs12238997   1 693731  A  G -0.1278 0.5031        +-    72.7 1459
2: rs72631875   1 705882  G  A -0.1496 0.5819        ++     0.0 1459
3: rs12029736   1 706368  A  G -0.0230 0.8601        -+    41.5 1458
4: rs12184267   1 715265  C  T -0.0002 0.9994        +-     0.0 1459
Returning path to saved data.
[1] "/var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpTSv9nO/file165f14cb3e231.tsv.gz"
HaglundA commented 1 year ago

yes indeed, it's a version issue. It seems I had an older bioconductor version which installed an older version of MungeSumstats with the default installation method. So sorry and thanks for getting back to me so quickly!!