Table of Contents
rareGWAMA is a flexible software package for imputation based GWAS meta-analysis.
It is developed and maintained by Dajiang Liu's Group.
Liu DJ†, Peloso GM, Zhan X, Holmen O, Zawistowski M, Feng S, Nikpay M, Auer PL, Goel A, Zhang H, Peters U, Farrall M, Orho-Melander M, Kooperberg C, McPherson R, Watkins H, Willer CJ, Hveem, K, Melander O, Kathiresan S, Abecasis GR†
Meta-analysis of gene-level tests of rare variant association, Nature Genetics, 46, 200–204 (2014)
doi: 10.1038/ng.2852.
The package is hosted on github, which allows installation and update to be very easy. First, make sure you have the mvtnorm
and data.table
packages installed:
install.packages("devtools")
And also, you need the latest version of seqminer:
library(devtools)
devtools::install_github("zhanxw/seqminer")
Then you could use:
install_github("dajiangliu/rareGWAMA")
With library(rareGWAMA)
, your are ready to go!
1.The very basic test is using:
res <- rareGWAMA.single(score.stat.file=study, imp.qual.file =imp.qual, tabix.range="1:11000-58000", alternative="two.sided", col.impqual=5, impQual.lb=0, impQualWeight=FALSE, weight="Npq+impQ",gc=FALSE, rmMultiAllelicSite=TRUE);
please find more details in the input and arguments part for the arguments:
- score.stat.file: The file names of score statistic files, which could be a vector object;
- imp.qual.file: The file names of imputation quality, which could be a vector object;
- tabix.range: The tabix range, which must be in quote and provided as a string like this: "1:11000-58000";
- alternative: The alternative hypothesis. Default is two.sided;
- col.impqual: The column number for the imputation quality score;
- impQual.lb: The lower bound for the imputation quality. Variants with imputaiton quality less than impQual.lb will be labelled as missing;
- impQualWeight: Using imputation quality as weight;
- rmMultiAllelicSite: Default is TRUE. Multi-allelic sites will be removed from the analyses if set TRUE, and a variable posMulti will be output; The variant site with multiple alleles can be analyzed using rareGWAMA.single.multiAllele function;
2.The out put should be as follows:
head(res$res.formatted))
POS REF ALT AF STAT PVALUE BETA SD N DIRECTION EFFECTIVE_N
[1,] "9:100000040" "G" "A" "2.28e-05" "1.01e+00" "3.16e-01" " 7.29e-01" " 0.72624" "41634" "XXXXXXXXXXXXXXXXXXXXXXXXXXXX++XX" "7925"
[2,] "9:100000056" "T" "TA" "7.07e-01" "1.62e+00" "2.04e-01" "-9.09e-03" " 0.00715" "47219" "XXXXXXXXXXXXXXXX-XXXXXXXXXXX--XX" "46951"
[3,] "9:100000102" "G" "C" "6.91e-04" "8.00e-01" "3.71e-01" "-7.78e-02" " 0.08691" "95869" "+++---+-------++XX-++--X++----XX" "36378"
[4,] "9:100000172" "C" "T" "2.26e-05" "2.25e-01" "6.35e-01" "-1.99e-01" " 0.41982" "125656" "XXXXXXXXXXXXXXXX-XXXXXXXXXXX--X+" "20434"
[5,] "9:100000177" "C" "G" "2.16e-04" "8.38e-01" "3.60e-01" "-1.04e-01" " 0.11335" "179891" "-+---X+XX-+--+-+-X-+-+-X++-+--X+" "89179"
[6,] "9:100000187" "A" "T" "1.13e-04" "6.78e-01" "4.10e-01" "-2.35e-01" " 0.28514" "54235" "+-++-XX+--++--+-XX+++++X----XXXX" "21285"
?rareGWAMA.single
.1.The command should be like:
res <- rareGWAMA.cond.single(score.stat.file=study, imp.qual.file=imp.qual, vcf.ref.file="{$your_path}/ALL.chr9.phase3.genotypes.vcf.gz", candidateVar="9:97018619", knownVar="9:100000172", alternative="two.sided", col.impqual=5, impQual.lb=0, impQualWeight=FALSE, weight="Npq+impQ", gc=FALSE, rmMultiAllelicSite=TRUE);
please find more details in the input and arguments part for the arguments:
- score.stat.file: The file names of score statistic files, which could be a vector object;
- imp.qual.file: The file names of imputation quality, which could be a vector object;
- vcf.ref.file: the file names of the reference panel file, e.g. could be downloaded from 1000 Genomes Project.
- candidateVar: The tabix range;
- knownVar: The known variant;
- alternative: The alternative hypothesis. Default is two.sided;
- col.impqual: The column number for the imputation quality score;
- impQual.lb: The lower bound for the imputation quality. Variants with imputaiton quality less than impQual.lb will be labelled as missing;
- impQualWeight: Using imputation quality as weight;
- rmMultiAllelicSite: Default is TRUE. Multi-allelic sites will be removed from the analyses if set TRUE, and a variable posMulti will be output; The variant site with multiple alleles can be analyzed using rareGWAMA.single.multiAllele function;
2.The out put should be as follows:
head(res$res.formatted))
POS REF ALT AF STAT PVALUE BETA SD N numStability
[1,] "9:97018619" "T" "C" "8.82e-05" "0.093" "0.76" "0.131" "0.428" "54235" "0"
?rareGWAMA.cond.single
.For more details, please see the SHOWCASE
1.The command should be like:
res.gene <- rareGWAMA.gene(score.stat.file, imp.qual.file=imp.qual.file, vcf.ref.file, refFileFormat="vcf.vbi", anno=anno, annoType=c('Nonsynonymous','Stop_Gain',"Essential_Splice_site"), rvtest='VT', ref.ancestry=ref.ancestry, trans.ethnic=TRUE, study.ancestry=study.ancestry, maf.cutoff=0.01);
please find more details in the input and arguments part for the arguments:
- score.stat.file: The file names of score statistic files, which could be a vector object;
- imp.qual.file: The file names of imputation quality, which could be a vector object;
- vcf.ref.file: The file names of the reference panel file, e.g. could be downloaded from 1000 Genomes Project.
- anno: Annotation file or list;
- annoType: The annotation types you want to use;
- rvtest: The method you want to use (i.e. 'VT', 'BURDEN', 'SKAT');
- ref.ancestry: The ancestry information for each sample;
- study.ancestry: The ancestry information for each study;
- maf.cutoff: The cutoff for the MAF, could be 0.01, 0.05, 0.001, whatever you want;
2.The out put should be as follows:
GENE RANGE STAT P-VALUE MAF_CUTOFF NUM_VAR TOTAL_MAF POS_VAR N POS_SINGLE_MINP BETA_SINGLE_MINP SD_SINGLE_MINP
MATN2 8:97931370-98033638 2.19 0.353 0.009472856879857 3 0.0123 8:97961468_G/A,8:98018087_G/A,8:98021213_G/A 204783 8:98021213_G/A 0.01813449913
STK3 8:98767360-98767360 0.0894 0.765 0.00582326555223991 1 0.00582 8:98767360_C/T 235921 8:98767360_C/T 0.00505283659608632 0.0168969355386225
VPS13B 8:99121478-99853608 0.305 0.752 0.00630135839199519 2 0.0117 8:99778930_G/A,8:99820031_A/G 283684 8:99778930_G/A 0.0114882148633614 0.016
COX6C 8:99892015-99892015 0.349 0.555 0.00244114819976761 1 0.00244 8:99892015_G/A 231499 8:99892015_G/A -0.0159448806371386 0.0269803627402012
:point_right: All the score statistics files and imputation quality files should be tabix indexed.
Say if you have such files: study1.gz(score statistics files), study2.gz, study1.R2.gz(imputation quality files), study2.R2.gz
in your folder, and already added tabix to your $PATH
environment variable, you could use:
for file in study*;do g=`zcat $file | head -200 | grep -n CHROM | cut -f1 -d":"`; tabix -f -S $g -s 1 -b 2 -e 2 $file; done
to tabix each file.
If you use 1 RVTESTS, the output is ready to go:
CHROM POS REF ALT N_INFORMATIVE AF INFORMATIVE_ALT_AC CALL_RATE HWE_PVALUE N_REF N_HET N_ALT U_STAT SQRT_V_STAT ALT_EFFSIZE PVALUE
1 10177 A AC 2352 0.5 2352 1 0 0 2352 0 1.67496 2.51553 0.264695 0.505508
1 10235 T TA 2352 0 0 1 1 2352 0 0 -0.108472 0.207841 -2.51104 0.601742
1 10352 T TA 2352 0.5 2352 1 0 0 2352 0 0.665562 2.61389 0.0974122 0.799013
1 10539 C A 2352 0 0 1 1 2352 0 0 -0.00020902 0.0626305 -0.0532862 0.997337
An example file has the following format:
CHROM POS REF ALT Rsq
1 10177 A AC 0.00581
1 10235 T TA 0.00396
1 10352 T TA 0.00608
1 10539 C A 0.00154
1 10616 CCGCCGTTGCAAAGGCGCGCCG C 0.02085
1 10642 G A 0.00013reference allele
There are five columns, which are Chromosome #
, Position
, Reference allele
, Alternative allele
and R square quality
. The higher of R square quality
(the 5th column) value, the better of the genotype imputation quality.
The file names of the reference panel file.
You could download the files from 1000 Genomes Project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502.
Also, you could subset the ranges you want by using tabix, with command looks like:
tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768
For more details, please see: How do I get a sub-section of a VCF file?;
chrom pos ref alt af anno gene
112207 1 32247432 G A 0.03425920 Nonsynonymous FAM167B
112208 1 32247598 A G 0.07993410 Exon FAM167B
112209 1 32247709 G T 0.00645166 Intron FAM167B
112210 1 32248405 G C 0.00290383 Nonsynonymous FAM167B
140157 1 41479019 C T 0.09314330 Utr3 EDN2
140158 1 41479369 C T 0.06422470 Utr3 EDN2
1.ref.ancestry.
You should have a original file (i.e. ref.ancestry.ori
) as:
head(ref.ancestry.ori)
fid ancestry study
1 samp1 HIM HCHS
2 samp2 HIM HCHS
3 samp3 HIM HCHS
4 samp4 HIM HCHS
5 samp5 HIM HCHS
6 samp6 HIM HCHS
Then, you use:
ref.ancestry <- cbind(ref.ancestry.ori[,1], paste(ref.ancestry.ori[,2], ref.ancestry.ori[,3], sep=","))
So, the final format should be a matrix:
[,1] [,2]
[1,] "samp1" "HIM,HCHS"
[2,] "samp2" "HIM,HCHS"
[3,] "samp3" "HIM,HCHS"
[4,] "samp4" "HIM,HCHS"
[5,] "samp5" "HIM,HCHS"
[6,] "samp6" "HIM,HCHS"
2.study.ancestry.
Just a vector, as
[1] "AACAC" "AMISH" "ARIC" "BAGS" "CFS" "CHS"
Questions and requests can be sent to Github issue page (link) or Dajiang Liu (dajiang.liu@outlook.com) and Fang Chen(fchen1@hmc.psu.edu)
1: Xiaowei Zhan, Youna Hu, Bingshan Li, Goncalo R. Abecasis, and Dajiang J. Liu
RVTESTS: An Efficient and Comprehensive Tool for Rare Variant Association Analysis Using Sequence Data
Bioinformatics 2016 32: 1423-1426. doi:10.1093/bioinformatics/btw079 (PDF)