Closed HaglundA closed 1 year ago
Hey! Delighted to hear you find the package useful and that it extends some of the functionality you made yourself!
I believe this issue is just caused by you using an older version of MSS, what version do you have installed? It runs as expected for me:
> packageVersion("MungeSumstats")
[1] ‘1.6.0’
This is the current bioconductor release version.
The run:
> MungeSumstats::format_sumstats(path=paste0("~/Downloads/","sample_gwas.txt"))
******::NOTE::******
- Formatted results will be saved to `tempdir()` by default.
- This means all formatted summary stats will be deleted upon ending the R session.
- To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
********************
Formatted summary statistics will be saved to ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpTSv9nO/file165f14cb3e231.tsv.gz
Importing tabular file: ~/Downloads/sample_gwas.txt
Checking for empty columns.
Standardising column headers.
First line of summary statistics file:
Chr Pos Allele1 Allele2 rsID Effect P Direction HetISq N
Summary statistics report:
- 1,000 rows
- 1,000 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 1 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Inferring genome build.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1,000 SNPs using BSgenome::snpsById...
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval,
evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:base’:
expand.grid, I, unname
BSgenome::snpsById done in 292 seconds.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 293 seconds.
Inferred genome build: GRCH37
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 136 seconds.
2 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 1,000 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 118 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 610 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
SE is not present but can be imputed with BETA & P. Set impute_se=TRUE and rerun to do this.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
524 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
Sorting coordinates.
Writing in tabular format ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpTSv9nO/file165f14cb3e231.tsv.gz
Summary statistics report:
- 476 rows (47.6% of original 1,000 rows)
- 476 unique variants
- 0 genome-wide significant variants (P<5e-8)
- 1 chromosomes
Done munging in 14.11 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 BETA P DIRECTION HETISQT N
1: rs12238997 1 693731 A G -0.1278 0.5031 +- 72.7 1459
2: rs72631875 1 705882 G A -0.1496 0.5819 ++ 0.0 1459
3: rs12029736 1 706368 A G -0.0230 0.8601 -+ 41.5 1458
4: rs12184267 1 715265 C T -0.0002 0.9994 +- 0.0 1459
Returning path to saved data.
[1] "/var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//RtmpTSv9nO/file165f14cb3e231.tsv.gz"
yes indeed, it's a version issue. It seems I had an older bioconductor version which installed an older version of MungeSumstats with the default installation method. So sorry and thanks for getting back to me so quickly!!
Let me preface by saying I love the package! It's been incredibly arduous to harmonize gwas summary stats, I'm so glad I've come across this package. I was unaware it existed and so have created many functions for my own use (liftover, detect build, harmonize alleles, convert chrpos to rsid, etc...) but they are only a fraction of what this package offers.
format_sumstats
doesn't seem to detect lowercase alleles. Fixed with base Rtoupper
. I've attached two sample datasets (before and after toupper of the Allele1 and Allele2 columns) - in my own harmonization scripts, I've added the following lines to make sure they are uppercase;Hope this helps.
sample_gwas.txt sample_gwas_toupper.txt
Console output, raw GWAS
MungeSumstats::format_sumstats(path=paste0(gwas_dir,"sample_gwas.txt"))
Console output, toupper GWAS allele columns
MungeSumstats::format_sumstats(path=paste0(gwas_dir,"sample_gwas_toupper.txt"))