Atkinson-Lab / Tractor

Scripts for implementing the Tractor pipeline
MIT License
44 stars 5 forks source link
gwas-model tractor tractor-pipeline

Current Version: v1.4.0 (released May 10, 2024)

TRACTOR - Local Ancestry Aware GWAS

Tractor is a specialized tool designed to enhance Genome-Wide Association Studies (GWAS) for diverse cohorts by addressing challenges associated with analyzing admixed populations. Admixed populations are often excluded from genomic studies due to concerns about how to properly account for their complex ancestry.

Tractor facilitates the inclusion of admixed individuals in association studies by leveraging local ancestry, allowing for finer resolution in controlling for ancestry in GWAS, and empowering identification of ancestry-specific loci.

Classic GWAS vs. TRACTOR GWAS

Unlike traditional GWAS methods, Tractor requires local ancestry estimates in its analyses. It employs a multi-step approach involving phasing, local ancestry inference, and regression analysis with ancestral allele dosages. This method aims to improve the accuracy of association analyses in cohorts with diverse ancestries, overcoming issues such as population stratification and variable linkage disequilibrium patterns.

Contents

Setup Conda environment

We recommend creating a Conda environment to run Tractor locally. This will install the necessary Python 3 and R dependencies required by the scripts.

conda env create -f conda_py3_tractor.yml
conda activate py3_tractor

Contents

Steps for Running Tractor Locally

IMPORTANT: Ensure your genotype data is phased (VCF file) and local ancestry is inferred for the following steps. Refer to our Tractor tutorial for initial setup instructions.

All scripts desribed in the following steps are available in the scripts directory, and Hail implementation is present in the ipynbs directory

Step 0 [Optional]: Recovering Haplotypes Disrupted by Statistical Phasing

Statistical phasing can lead to switch errors as described in Fig. 1 of the Tractor publication. For this purpose, we have written two scripts, unkink_2way_mspfile.py and unkink_2way_genofile.py. These scripts help recover disrupted tracts from the MSP file and VCF file, rectifying errors, and outputs an unkinked VCF file that can be used for subsequent steps. Currently they are implemented for two-way admixed popuations only.

Contents

Step 1: Extracting Tracts and Ancestry Dosages

Simultaneously extract risk allele and local ancestry information, a prerequisite for running Tractor GWAS. The scripts output risk allele by ancestry dosages and haplotype counts for the input VCF files. A file of each of these is generated for each ancestry component.

Contents

Step 2: Running Tractor

The Tractor code runs in R, and to make sure the script works, you'll need to install the following libraries. Your conda environment should handle these installations by default.

install.packages('optparse')
install.packages('data.table')
install.packages('R.utils')
install.packages('dplyr')
install.packages('doParallel')

Arguments:

--hapdose       [Mandatory] Prefix for hapcount and dosage files.
                    E.g. If you have the following files:
                         filename.anc0.dosage.txt filename.anc0.hapcount.txt
                         filename.anc1.dosage.txt filename.anc1.hapcount.txt
                    use "--hapdose filename".
--phenofile     [Mandatory] Path to the file containing phenotype and covariate data. 
                    Default assumptions: Sample ID column: "IID" or "#IID", Phenotype column: "y".
                    If different column names are used, refer to --sampleidcol and --phenocol arguments.
                    All covariates MUST be included using --covarcollist.
--covarcollist  [Mandatory] Specify column names of covariates in the --phenofile.
                    Only listed columns will be included as covariates.
                    Separate multiple covariates with commas.
                    E.g. --covarcollist age,sex,PC1,PC2.
                    To exclude covariates, specify "--covarcollist none".
--method        [Mandatory] Specify the method to be used: <linear> or <logistic>.
--output        [Mandatory] File name for summary statistics output.
                    E.g. /path/to/file/output_sumstats.txt

--sampleidcol   [Optional] Specify sample ID column name in the --phenofile.
                    Default: "IID" or "#IID"
--phenocol      [Optional] Specify phenotype column name in the --phenofile.
                    Default: "y"
--chunksize     [Optional] Number of rows to read at once from hapcount and dosage files.
                    Use smaller values for lower memory usage.
                    Note: Higher chunksize speeds up streaming but requires more memory.
                    If out-of-memory errors occur, try increasing memory or
                    reducing --chunksize or --nthreads.
                    Default: 10000
--nthreads      [Optional] Specify number of threads to use.
                    Increasing threads can speed up processing but may increase memory usage.
                    Default: 1
--totallines    [Optional] Specify total number of lines in hapcount/dosage files (wc -l *.hapcount.txt).
                    If not provided, it will be calculated internally (recommended).
                    Exercise caution: if --totallines is smaller than the actual lines in the files, 
                    only a subset of data will be analyzed. If larger than the actual lines in the files,
                    an error will occur. Both scenarios are discouraged.

Example Run (with Mandatory Arguments)

Example Run (with Optional Arguments)

Output Files (Running Tractor)

Tractor generates ancestry-specific summary statistics, producing output files with varying column numbers based on the input number of ancestries.

All summary statistic files include:

Example Output File Structure

CHR             Chromosome 
POS             Position 
ID              SNP ID
REF             Reference allele
ALT             Alternate allele
N               Total sample size
AF_anc0         Allele frequency for anc0; sum(dosage)/sum(local ancestry)
LAprop_anc0     Local ancestry proportion for anc0; sum(local ancestry)/2 * sample size
beta_anc0       Effect size for alternate alleles inherited from anc0
se_anc0         Standard error for effect size (beta_anc0)
pval_anc0       p-value for alternate alleles inherited from anc0 (NOT -log10(pvalues))
tval_anc0       t-value for anc0
...
LApval_anc0     p-value for the local ancestry term (X1 term in Tractor)
LAeff_anc0      Effect size for the local ancestry term (X1 term in Tractor)
...

Contents

Steps for Running Tractor on Hail

License

The Tractor program is licensed under the MIT License. You may obtain a copy of the License here.

Cite this article

The methodology and utility of Tractor are more fully described in our manuscript. If you use Tractor in your research, please cite the following article:

Atkinson, E.G., Maihofer, A.X., Kanai, M. et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet 53, 195–204 (2021). Link

For any inquiries, you can contact Elizabeth G. Atkinson at elizabeth.atkinson@bcm.edu.

Contents