SumStatsRehab is a universal GWAS SumStats pre-processing tool. SumStatsRehab takes care of each of the original data points to maximize statistical power of downstream calculations. Currently, the only supported processing which may result in a loss of the original data is liftover, which is a common task, and is optional to the user.
Examples of what the tool does:
diagnose
command),Example of what the tool does not:
SumStatsRehab aims to be a production-grade software, but you may find it to be not a complete data preparation solution for sumstats-ingesting pipelines. Yet, the tool streamlines the development of the data preparation part. This comes out from focusing on solving more complex problems that span many different use-cases, instead of covering only specific use-cases.
requirements.txt
prepare_dbSNPs
command)prepare_dbSNPs
command)clone this repo
git clone https://github.com/Kukuster/SumStatsRehab.git && cd SumStatsRehab
install requirements
pip install -r requirements.txt
install SumStatsRehab as a package
python3 setup.py build
python3 setup.py install
run the command using the following syntax:
SumStatsRehab <command> [keys]
Use diagnose
to check the validity of entries in the GWAS SS file.
Use fix
to restore missing/invalid data in the GWAS SS file.
Use prepare_dbSNPs
to preprocess a given dbSNP dataset into 2 datasets, which are used in the fix
command.
Use sort
to format the input GWAS SS file and sort either by Chr and BP or by rsID.
To use the fix
command to its fullest, a user needs:
prepare_dbSNPs
command.pip install git+https://github.com/Kukuster/SumStatsRehab.git
or, for specific version:
pip install git+https://github.com/Kukuster/SumStatsRehab.git@v1.2.0 --upgrade
This installation method doesn't work with the currently upcoming git protocol security update on github:
Download dbSNP datasets from NCBI, in the target build, in vcf, vcf.gz, bcf, or bcf.gz format. The latest versions are recommended. dbSNP datasets are used to restore the following data: Chr, BP, rsID, OA, EA, EAF. Although only builds 37 and 38 are explicitly supported, build 36 may work as well.
For example, curently latest datasets for build 38 and build 37 can be downloaded here:
https://ftp.ncbi.nih.gov/snp/latest_release/VCF/
A chain file is necessary to perform liftover. If a GWAS SS file is provided in the target build, then a chain file is not used.
see instructions on their websites and/or githubs
recommended bcftools version: 1.11
NOTE: after preprocessing of the necessary dbSNPs is finished, these tools are no longer needed
Run prepare_dbSNPs
using the following syntax:
SumStatsRehab prepare_dbSNPs --dbsnp DBSNP --OUTPUT OUTPUT --gz-sort GZ_SORT --bcftools BCFTOOLS
[--buffer BUFFER]
where:
DBSNP
is the dbSNP dataset in vcf, vcf.gz, bcf, or bcf.gz format referencing build 38 or 37OUTPUT
is the base name for the two output dbSNPs datasetsGZ_SORT
is a path to the gz-sort executableBCFTOOLS
is a path to the bcftools executableBUFFER
is buffer size for sorting (size of presort), supports k/M/G suffix. Defaults to 1G. Recommended: at least 200M, ideally: 4G or moreDepending on the size of the dataset, specified buffer size, and specs of the machine, preprocessing may take somewhere from 30 minutes to 6 hours.
After preprocessing, steps 4 and 5 may be repeated ad-lib.
Config file is used as meta data for GWAS SS file, and contains: 1) columns' indices (indices start from 0) 2) input build slug (such as "GRCh38", "GRCh37", "hg18", "hg19")
This config file has to have the same file name as the GWAS SS file but with an additional .json
extension.
For example, if your GWAS SS file is named WojcikG_PMID_htn.gz
, and the first 5 lines in the unpacked file are:
Chr Position_hg19 SNP Other-allele Effect-allele Effect-allele-frequency Sample-size Effect-allele-frequency-cases Sample-size-cases Beta SE P-val INFO-score rsid
1 10539 1:10539:C:A C A 0.004378065 49141 0.003603676 27123 -0.1041663 0.1686092 0.5367087 0.46 rs537182016
1 10616 rs376342519:10616:CCGCCGTTGCAAAGGCGCGCCG:C CCGCCGTTGCAAAGGCGCGCCG C 0.9916342 49141 0.9901789 27123 -0.1738814 0.109543 0.1124369 0.604 rs376342519
1 10642 1:10642:G:A G A 0.006042409 49141 0.007277901 27123 0.1794246 0.1482529 0.226179 0.441 rs558604819
1 11008 1:11008:C:G C G 0.1054568 49141 0.1042446 27123 -0.007140072 0.03613677 0.84337 0.5 rs575272151
your config file should have the name WojcikG_PMID_htn.gz.json
and the following contents:
{
"Chr": 0,
"BP": 1,
"rsID": 13,
"OA": 3,
"EA": 4,
"EAF": 5,
"beta": 9,
"SE": 10,
"pval": 11,
"INFO": 12,
"N": 6,
"build": "grch37"
}
Notes:
fix
command will attempt to restore them.fix
commandWhen the config file is created, and dbSNP datasets are preprocessed, the chain file is downloaded if necessary, then the fix
command can use all its features.
Although it is normally a part of the execution of the fix
command, a user may choose to manually run diagnose
beforehand.
If diagnose
is ran without additional arguments, it is "read-only", i.e. doesn't write into the file system.
Run diagnose
as follows:
SumStatsRehab diagnose --INPUT INPUT_GWAS_FILE
where INPUT_GWAS_FILE
is the path to the GWAS SS file with the corresponding config file at *.json
as a result, it will generate the main plot: stacked histogram plot, and an additional bar chart plot for each of the bins in the stacked histogram plot.
These plots will pop up in a new matplotlib window.
The stacked histogram maps the number of invalid SNPs against p-value, allowing assessment of the distribution of invalid SNPs by significance. On the histogram, valid SNPs are shown as blue, and SNPs that have issues are shown as red. The height of the red plot over each bin with the red caption represents the proportion of invalid SNPs in the corresponding bin.
A bar chart is generated for each bin of the stacked histogram plot and reports the number of issues that invalidate the SNP entries in a particular bin.
If a Linux system runs without GUI, the report should be saved on the file system. For this, run the command as follows:
SumStatsRehab diagnose --INPUT INPUT_GWAS_FILE --REPORT-DIR REPORT_DIR
where REPORT_DIR
is an existing or not existing directory under which the generated report will be contained. When saved onto a disk, the report also includes a small table with exact numbers of invalid fields and other issues in the GWAS SS file.
Finally, a user may want to decide to run the fix
command.
A user should run the fix
command as follows:
SumStatsRehab fix --INPUT INPUT_GWAS_FILE --OUTPUT OUTPUT_FILE
[--dbsnp-1 DBSNP1_FILE] [--dbsnp-2 DBSNP2_FILE]
[--chain-file CHAIN_FILE]
[--freq-db FREQ_DATABASE_SLUG]
[{--restore,--do-not-restore} {ChrBP,rsID,OA,EA,EAF,beta,SE,pval}+]
where:
INPUT_GWAS_FILE
is the input GWAS SS file with the corresponding .json
config file create at step 4OUTPUT_FILE
is the base name for the fixed file(s) DBSNP1_FILE
is a path to the preprocessed dbSNP #1DBSNP2_FILE
is a path to the preprocessed dbSNP #2CHAIN_FILE
is a path to the chain fileFREQ_DATABASE_SLUG
is a slug of a frequency database contained in the dbSNPexample:
SumStatsRehab fix --INPUT "29559693.tsv" --OUTPUT "SumStatsRehab_fixed/29559693" --dbsnp-1 "dbSNP_155_b38.1.tsv.gz" --dbsnp-2 "dbSNP_155_b38.2.tsv.gz" --chain-file "hg19_to_hg38.chain" --freq-db TOPMED --do-not-restore OA EA
As the normal process of fix
, a report will be generated for the input file, as well as for the file after each step of processing. Depending on the availability of invalid/missing data in the GWAS SS file and the input arguments, a different number of steps may be required for a complete run of the fix
command, with 1 or 2 loops performed on the GWAS SS file. All steps are performed automatically without prompt. The process of fix
ing is represented in logging to the standard output and may take anywhere from 5 minutes to 1.5 hours, depending on the size of the file and the number of steps.
As a result, if 1 loop was required to fix the file, then the resulting file will be available with the suffix .rehabed.tsv
. If 2 loops were required, then the resulting file is available with the suffix .rehabed-twice.tsv
.
The report made with a diagnose
command will be available in a separate directory for:
Please refer to the instructions by running
SumStatsRehab -h
or
SumStatsRehab <command> -h
Config file is a json object which supports the following properties:
"Chr"
"BP"
"rsID"
"EA"
"OA"
"EAF"
"beta"
"SE"
"pval"
"N"
"INFO"
should be evaluated to an integer: the corresponding column index starting from 0.And
"build"
should be evaluated to either one of the following (case insensitive): 'hg38'
, 'GRCh38'
, 'hg19'
, 'GRCh37'
, 'hg18'
, 'ncbi36'
."other"
should be evaluated to an array of integers: indicies of other columns to includeIt is also possible to set EAF
to a weighted average of multiple colums. E.g. if there are separate freq. columns for case and control groups, and average freq. is needed, number of participants in each group will serve as weights for the two columns:
{
...
"EAF": {
"4": 1001,
"5": 2500
},
...
}
During the fix
command, the input sumstats file may undergo sorting. If you want any other columns to be included in the resulting fixed file, add 0-indexed column indices in an array as the "other"
parameter in the config.
E.g.:
{
...
"other": [7, 8, 2],
...
}
With this entry, the resulting file will have 3 additional columns at the end in the order as their indices appear in this config entry.
Commands fix
and sort
always output files in this format. Internally, input files are always converted before undergoing any processing. lib/prepare_GWASSS_columns.py
is responsible for formatting the input.
STANDARD_COLUMN_ORDER
in lib/standard_column_order.py
.
STANDARD_COLUMN_ORDER
The two key functions of SumStatsRehab are validation and restoration, implemented for 9 data categories: chromosome, base pair position, rsID, effect allele, other allele, allele frequency, standard error, beta, p-value. Each column is validated independently of the others, and is regarded to have two possible states: valid and invalid. With the exception of the case where only one of either the chromosome or base pair position is a valid entry, valid entries are always kept and invalid entries are subject to restoration.
While most of the columns are assessed independently, the restoration process often involves multiple columns.
Warning: Restored betas with an unknown sign should not be utilized in any downstream application.
fix
command to turn on and off the restoration of the ambiguous effect alleles (in cases when the dbSNPs present multiple options for ALT)SumStatsRehab.py
loop_fix.py
: make a separate function loopDB1 and loopDB2 that will loop through enough entries in a DB before every resolver and rewrite a "global" object with properties to be fields from the DB: rsID, Chr, BP, alleles, EAF. So resolvers for rsID and ChrBP will be similar to ones for alleles and EAF. Resolvers for these fields then should operate on fields
and that object with fields from a DB. This way a really strong optimization, flexibility, and modularity of resolvers will be achieved. run_all
doesn't have to have resolvers and resolvers_args object to be passed, it can just use the global ones.bash
, cut
, paste
, sort
, awk
, gzip
, gunzip
, head
, tail
, rm
, wc
. Need to do more tests of SumStatsRehab with a different versions of bash
, awk
(including gawk
, nawk
, mawk
. E.g. even though gawk
is default for GNU/Linux, Ubuntu has mawk
by default).pip
diagnose
command script should also generate a bar chart for the bin with missing p-valueamputate
command. Runs diagnosis
and removes all the invalid rows.lib/loop_fix.py
: it has to catch speicifc Exceptions everywherefix
command --verbose
key was not set, then remove the intermediate files.gz-sort
or bcftools
executable.fix
command interface: if --verbose
is set, then forbid the --OUTPUT
argument and allow --OUTPUT-PREFIX
.fix
command was not set to --verbose
, then if the --OUTPUT
path ends with .gz
or .zip
, then compress the resulting file on the output.