frankvogt / vcf2gwas

Python API for comprehensive GWAS analysis using GEMMA
GNU General Public License v3.0
90 stars 29 forks source link
gemma-analysis gwas vcf

vcf2gwas

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

Contents

About The Project

Performing a genome-wide association study (GWAS) on a dataset can be a laborious task, especially when analysing multiple phenotypes. VCF and input files have to be processed and prepared in the right way depending on the way the analysis is performed and afterwards various operations need to be carried out.

vcf2gwas is a Python-built API for GEMMA, PLINK and bcftools performing GWAS directly from a VCF file as well as multiple post-analysis operations.
Some of the benefits of this pipeline include:

If you use vcf2gwas in your research, please cite us:
vcf2gwas - Python API for comprehensive GWAS analysis using GEMMA

Built with

vcf2gwas was built using Python, bcftools, PLINK and GEMMA.

GEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for GWAS. It fits either a univariate linear mixed model, a multivariate linear mixed model or a Bayesian sparse linear mixed model.

The exact versions of Python, bcftools, PLINK and GEMMA used to build the pipeline are available in the environment file.

Getting Started

These instructions will provide an easy way to get vcf2gwas running on your local machine.
vcf2gwas works on macOS and Linux systems when run via conda. When running the vcf2gwas docker image, vcf2gwas runs on all operating systems supported by docker.

Prerequisites

The only requirement is an up to date version of either conda or docker installed on your machine.

Installation via conda

It is a good practice to install the package in a clean environment.
So first create a new environment (you can name it as you like), here with the exemplary name 'myenv':

conda create -n myenv

Next, activate the environment by typing:

conda activate myenv

Now, the vcf2gwas package can be installed:

conda install vcf2gwas -c conda-forge -c bioconda -c fvogt257 

Everything is ready for analysis now.
Optionally, to test the installation and copy the example files to your current working directory, run:

vcf2gwas -v test

Once the analysis is completed, the environment can be deactivated:

conda deactivate

Installation via docker

To download the vcf2gwas docker image, run the following command:

docker pull fvogt257/vcf2gwas

Everything is ready for analysis now.
Optionally, to test the image and copy the example files to your current working directory, run:

docker run -v /path/to/current-working-directory/:/vcf2gwas/ fvogt257/vcf2gwas -v test

Usage

The items below will explain the required format of the input files, the basic usage and available options as well as the structure of the output files.

Input Files

There are multiple files that can be provided as input for vcf2gwas, below you can find an overview over these files.
For more information about the example files provided with vcf2gwas, please refer to the manual.

VCF file:

A VCF file containing the SNP data of the individuals to be examined is required to run vcf2gwas. This file does not need to be altered in any way and can be in either .vcf or .vcf.gz format.

Phenotype file(s):

One or multiple phenotype files can be used to provide the phenotype data for GEMMA. These files need to be in the comma separated .csv format.
In the first column one has to put the IDs of the individuals. These IDs must match the individuals' IDs of the VCF file, since mismatched IDs will be removed from analysis. The remaining columns resemble the phenotypes with the phenotype description as the column name. vcf2gwas will recognize either "-9" or "NA" as missing values and the phenotypes can be either continuous or binary.
Example files to run GEMMA can be found in the input folder (VCF file + corresponding phenotype file with one phenotype). Below is an excerpt of the exemplary phenotype file example.csv:

avrRpm
5837 1
6009 1
6898 1
6900 0
6901 0

Covariate file:

Note: A covariate file can only be used to provide covariates for the GEMMA analysis when running the linear model or the linear mixed model.
The covariate file has to be formatted in the same way as the phenotype file, with individual IDs in the first column and the covariates in the remaining columns with their respective names as column names.

Gene file:

vcf2gwas has GFF files for the most common species built-in. To compare the results, use the species abbreviation with the -gf / --genefile option (see File affiliated options). For more information about the available species, their abbreviations and the reference file used, please refer to the manual.
To compare the results of the GWAS analysis with specific genes, a gene file can be provided as input. The gene file has to be either a GFF3 formatted .gff file or a comma separated .csv file.
If in the .csv format, the file needs at least three columns containing information about chromosome, gene start position and gene stop position. These columns have to be named 'chr', 'start' and 'stop'.
Furthermore it is necessary that the chromosome information is in the same format as the chromosome information in the VCF file, otherwise vcf2gwas won't recognize the information correctly.
Optional columns providing additional information have to be called 'ID', 'name' and 'comment'. Below is an excerpt of an exemplary gene file in the .csv format:

chr start stop ID name comment
1 3644420 3647768 AT1G10920.1 LOV1
1 4144935 4147817 AT1G12220.1 RPS5 Disease resistance protein family
5 17462611 17467448 AT5G43470.1 RPP8 Disease resistance protein family

Relatedness matrix:

To perform GWAS, GEMMA needs a relatedness matrix, which vcf2gwas will calculate by default. Nonetheless one can provide a relatedness matrix manually.

Running vcf2gwas

Once the virtual environment is activated, vcf2gwas can be run on the command-line by specifying the input files and the statistical model chosen for GEMMA. Below is an exemplary command for running a linear mixed model analysis on all phenotypes in example.csv using genotype information from example.vcf.gz, both in the input directory.

vcf2gwas -v example.vcf.gz -pf example.csv -ap -lmm

Note: When running vcf2gwas via docker, replace in every command vcf2gwas with docker run -v /path/to/current-working-directory/:/vcf2gwas/ fvogt257/vcf2gwas:

docker run -v /path/to/current-working-directory/:/vcf2gwas/ fvogt257/vcf2gwas -v example.vcf.gz -pf example.csv -ap -lmm

The available options will be elucidated in the next section.
In the manual, detailed instructions on how to run vcf2gwas and its available options can be viewed.

Available Options:

File affiliated options:

GEMMA affiliated options:

Miscellaneous options:

Output

vcf2gwas will create an output folder with a hierarchical structure consisting of multiple folders containing plots, summaries, GEMMA output files, log files and so on, depending on the selected options.
Below are the QQ-plot and manhattan-plot that are produced when running the test command mentioned in Installation:

Manhattan-plotQQ-plot
The exemplary directory and file structure of the output folder after running a linear mixed model analysis on a single phenotype is shown below:

output/
└── 'model'
    ├── 'phenotype'
    │   ├── QQ
    │   │   └── QQ plot figure (.png)
    │   ├── summary file (.txt)
    │   ├── GEMMA output file (.txt)
    │   ├── GEMMA log file (.txt)
    │   ├── best_p-values
    │   │   ├── top 1% variants (.csv)
    │   │   ├── top 0.1% variants (.csv)
    │   │   └── top 0.01 variants (.csv)
    │   ├── manhattan
    │   │   └── manhattan plot figure (.png)
    │   └── significant SNP summary file (.csv)
    ├── files
    │   └── files_'file'
    │       ├── PLINK BED files (.bed, .bim, .fam, .nosex)
    │       ├── PLINK log file (.log)
    │       ├── GEMMA relatedness matrix (.txt)
    │       └── GEMMA log file (.log.txt)
    ├── logs
    │   └── analysis log file (.txt)
    ├── QC
    │   ├── phenotype QC plot (.png)
    │   └── genotype QC plots (.png)
    ├── summary
    │   ├── summarized top SNPs (.csv)
    │   └── top_SNPs
    │       └── phenotype top SNPs (.csv)
    └── vcf2gwas log file (.txt)

The names of the directories in quotes as well as the file names will vary based on the selected options and the file and phenotype names.

License

Distributed under the terms of the GNU General Public License. See LICENSE for more information.

Contact

If you run into any troubles, please raise an issue on the github page.

Acknowledgements

The GEMMA software was developed by:

Xiang Zhou
Dept. of Biostatistics
University of Michigan

Peter Carbonetto, Tim Flutre, Matthew Stephens, Pjotr Prins and others have also contributed to the development of the GEMMA software.