SNPGenie is a collection of Perl scripts for estimating πN/πS, dN/dS, and gene diversity from next-generation sequencing (NGS) single-nucleotide polymorphism (SNP) variant data. Each analysis uses its own script:
WITHIN-POOL ANALYSIS. Use snpgenie.pl
, the original SNPGenie. Analyzes within-sample πN/πS from pooled NGS SNP data. SNP reports (CLC, Geneious, or VCF) must each correspond to a single population/pool, with variants called relative to one reference sequence (one sequence in one FASTA file). Example:
snpgenie.pl --vcfformat=4 --snpreport=<variants>.vcf --fastafile=<ref_seq>.fa --gtffile=<CDS_annotations>.gtf
WITHIN-GROUP ANALYSIS. Use snpgenie\_within\_group.pl
. Traditional within-group analysis of dN/dS among aligned sequences in a FASTA file, similar to MEGA, with automation, sliding windows, parallelism, and overlapping gene features. Example:
snpgenie_within_group.pl --fasta_file_name=<aligned_seqs>.fa --gtf_file_name=<CDS_annotations>.gtf --num_bootstraps=10000 --procs_per_node=8
BETWEEN-GROUP ANALYSIS. Traditional between-group analysis comparing two or more groups (FASTA files) of aligned sequences, such as possible using MEGA, with automation, sliding windows, parallelism, and overlapping gene features. Example:
snpgenie_between_group.pl
UPDATE FOR VCF INPUT: given myriad VCF formats, it is NECESSARY TO SPECIFY THE FORMAT OF THE VCF SNP report using the --vcfformat argument. Format 4 is the most common. For details, see the section on VCF.
New applications of next-generation sequencing (NGS) use pooled samples containing DNA from multiple individuals to perform population genetic analyses. SNPGenie is a Perl program which analyzes the results of a single-nucleotide polymorphism (SNP) caller to calculate nucleotide diversity (including its nonsynonymous and synonymous partitions, πN and πS) and gene diversity. Variant calls are typically present in annotation tables and assume that the pooled DNA sample is representative of some population of interest. For example, if one is interested in determining the nucleotide diversity of a virus population within a single host, it might be appropriate to sequence the pooled nucleic acid content of the virus in a blood sample from that host. Comparing πN and πS for a gene product, or comparing gene diversity at polymorphic sites of different types, may help to identify genes or gene regions subject to positive (Darwinian) selection, negative (purifying) selection, or random genetic drift. SNPGenie also includes such features as minimum allele frequency trimming (see Options), and can be combined with upstream applications such as maximum-likelihood SNP calling techniques (e.g., see Lynch et al. 2014). For additional background, see Nelson & Hughes (2015) in the References.
SNPGenie is a command-line application written in Perl, with no additional dependencies beyond the standard Perl package (i.e., simply download and run). Input includes:
All input files must be placed in the same directory. Then simply run the appropriate SNPGenie script from the command line (see Options if you wish for more control). To do this, first download the snpgenie.pl script and place it in your system’s $PATH, or simply in your working directory. Next, place your SNP report(s), FASTA (.fa/.fasta), and GTF (.gtf) files in your working directory. Open the command line prompt (or Terminal) and navigate to the directory containing these files using the "cd" command in your shell. Finally, execute SNPGenie by typing the name of the script and pressing the \<RETURN> (\<ENTER>) key:
snpgenie.pl
Further details on input and options are below.
Only one reference sequence must be provided in a single FASTA (.fa/.fasta) file (i.e., one file containing one sequence). Thus, all SNP positions in the SNP reports are called relative to the single reference sequence. Because of this one-sequence stipulation, a script has been provided to split a multi-sequence FASTA file into its constituent sequences for multiple or parallel analyses; see Additional Scripts below.
The Gene Transfer Format (.gtf) file is tab (\t)-delimited, and must include non-redundant records for all CDS elements (i.e., open reading frames, or ORFs) present in your SNP report(s). Thus, if multiple records exist with the same gene name (e.g., different transcripts), ONLY ONE may be included in your analysis at a time. The GTF should also include any ORFs which do not contain any variants (if they exist). SNPGenie expects every coding element to be labeled as type "CDS", and for its product name to follow a "gene_id" tag. In the case of CLC and Geneious SNP reports, this name must match that present in the SNP report. If a single coding element has multiple segments (e.g., exons) with different coordinates, simply enter one line for each segment, using the same product name; they will be concatenated. Finally, for sequences with genes on the reverse '-' strand, SNPGenie must be run twice, once for each strand, with the minus strand's own set of input files (i.e., the '-' strand FASTA, GTF, and SNP report); see A Note on Reverse Complement ('-' Strand) Records below. For more information about GTF, please visit The Brent Lab. To convert a GFF file to GTF format, use a tool like gffread. A short GTF example follows:
reference.gbk CLC CDS 5694 8369 . + 0 gene_id "ORF1";
reference.gbk CLC CDS 8203 8772 . + 0 gene_id "ORF2";
reference.gbk CLC CDS 1465 4485 . + 0 gene_id "ORF3";
reference.gbk CLC CDS 5621 5687 . + 0 gene_id "ORF4";
reference.gbk CLC CDS 7920 8167 . + 0 gene_id "ORF4";
reference.gbk CLC CDS 5395 5687 . + 0 gene_id "ORF5";
reference.gbk CLC CDS 7920 8016 . + 0 gene_id "ORF5";
reference.gbk CLC CDS 4439 5080 . + 0 gene_id "ORF6";
reference.gbk CLC CDS 5247 5549 . + 0 gene_id "ORF7";
reference.gbk CLC CDS 4911 5246 . + 0 gene_id "ORF8";
Each SNP report should contain variant calls for a single pooled-sequencing run (i.e., sample from population), or alternatively a summary of multiple individual sequences. Its site numbers should refer to the exact reference sequence present in the FASTA input. SNP reports must be provided in one of the following formats:
A CLC Genomics Workbench SNP report must include the following columns to be used with SNPGenie, with the unaltered CLC column headers:
In addition to containing the aforementioned columns, the SNP report should ideally be free of thousand separators (,) in the Reference Position, Count, and Coverage columns (default .txt output). The Frequency must remain a percentage (default .txt output). Finally, the user should verify that the reading frame in the CLC input file is correct.
A Geneious SNP report must include the following columns to be used with SNPGenie, with the unaltered Geneious column headers:
In addition to containing the aforementioned columns, the SNP report should ideally be free of extraneous characters such as thousand separators (,). The Variant Frequency must remain a percentage (default .txt output). Finally, the user should verify that the reading frame in the Geneious output is correct.
A VCF SNP report must include the following columns, with the unaltered VCF column headers:
Because VCF files vary widely in format, SNPGenie now requires users to specify exactly which VCF format is being used with the --vcfformat argument. New formats are being added on a case-by-case basis; users should contact the author to have new formats incorporated. FORMAT 4 is the most common, and necessary if your file contains multiple \<SAMPLE> columns. Note that these formats are specific to SNPGenie; they do not refer to VCF version numbers. All supported formats are:
FORMAT (1): --vcfformat=1. Multiple individual genomes have been sequenced separately, with SNPs summarized in the VCF file. SNPGenie will require the following in the INFO column:
FORMAT (2): --vcfformat=2. Variants have been called from a pooled deep-sequencing sample (genomes from multiple individuals sequenced together). SNPGenie will require the following in the INFO column:
FORMAT (3): --vcfformat=3. Like format 2, variants have been called from a pooled deep-sequencing sample (genomes from multiple individuals sequenced together). Unlike format 2, SNPGenie will require the following in the INFO column:
FORMAT (4): --vcfformat=4. THE MOST COMMON FORMAT. Like formats 2 and 3, variants have been called from a pooled deep-sequencing sample (genomes from multiple individuals sequenced together). For this format, SNPGenie requires AD and DP data in the FORMAT and \<SAMPLE> columns, where FORMAT is the final column header before the \<SAMPLE> column(s) begins. The order of the data keys in the FORMAT column and the data values in the \<SAMPLE> columns must match.
As usual, you will want to make sure to maintain the VCF file's features, such as TAB(\t)-delimited columns. Unlike some other formats, the allele frequency in VCF is a decimal.
Many large genomes have coding products on both strands. In this case, SNPGenie must be run twice: once for the '+' strand, and once for the '-' strand. This requires FASTA, GTF, and SNP report input for the '-' strand. I provide a script for converting input files to their reverse complement strand in the Additional Scripts below. Note that, regardless of the original SNP report format, the reverse complement SNP report is in a CLC-like format that SNPGenie will recognize. For both runs, the GTF should include all products for both strands, with products on the strand being analyzed labeled '+' and coordinates defined with respect to the beginning of that strand's FASTA sequence. Also note that a GTF file containing only '-' strand records will not run; SNPGenie does calculations only for the products on the current + strand, using the '-' strand products only to determine the number of overlapping reading frames for each variant.
To alter how SNPGenie works, the following options may be used:
For example, if you wanted to specify a minimum allele frequency of 1% and specify all three input files, you could run the following:
snpgenie.pl --minfreq=0.01 --snpreport=mySNPreport.txt --fastafile=myFASTA.fa --gtffile=myGTF.gtf
SNPGenie creates a new folder called SNPGenie_Results within the working directory. This contains the following TAB-delimited results files:
SNPGenie_parameters.txt, containing the input parameters and file names.
SNPGenie_LOG.txt, documenting any peculiarities or errors encountered. Warnings are also printed to the Terminal (shell) window.
site_results.txt, providing results for all polymorphic sites. Note that, if the population is genetically homogenous at a site, even if it differs from the reference or ancestral sequence, it will not be considered polymorphic. Also keep in mind that columns are sorted by product first, then site number, with noncoding sites at the end of the file. Columns are:
codon_results.txt, providing results for all polymorphic sites. Columns are:
product_results.txt, providing results for all CDS elements present in the GTF file for the '+' strand. Columns are:
population_summary.txt, providing summary results for each population's sample (SNP report) with respect to the '+' strand. Columns are:
sliding_window_length\<Length>_results.txt, containing codon-based results over a sliding window, with a default length of 9 codons.
The script snpgenie_within_group.pl estimates mean dN and dS within a group of aligned sequences in a single FASTA file. Designed for use with sequence data that outsizes what can be handled by other software platforms, this script invokes parallelism for codons and bootstrapping, and as a result has one dependency: the Parallel::ForkManager module. If this isn't already installed, please use CPAN to install it before running the script. Click here to learn how.
Note that within-group mean dN and dS are mathematically equivalent to πN and πS when groups contain sequences sampled from a single population.
Provide the script with the following arguments:
The format for using this script is:
snpgenie_within_group.pl --fasta_file_name=<aligned_sequences>.fa --gtf_file_name=<coding_annotations>.gtf --num_bootstraps=<bootstraps> --procs_per_node=<procs_per_node>
For example, on a node with access to 64 processors in a high performance computing environment, you might submit a job including the following (DON'T run this from the head node!):
snpgenie_within_group.pl --fasta_file_name=my_virus_genomes.fa --gtf_file_name=my_virus_genes.gtf --num_bootstraps=10000 --procs_per_node=64
Results files will be generated in the working directory. For a detailed explanation of the results, consult the output descriptions above (however, note that not all files and columns generated for the within-pool analysis will be present for the within-group analysis).
To execute bootstrapping and/or sliding window analyses, run the processing script SNPGenie_sliding_windows.R (see Sliding Windows and Bootstrapping).
SNPGenie_sliding_windows.R replaces the bootstrapping and sliding window script snpgenie_within_group_processor.pl, which is no longer supported. Its (deprecated) description is below:
- --codon_file: the name or path of the within_group_codon_results.txt file produced using snpgenie_within_group.pl.
- --sliding_window_size: the size of the sliding windows to be calculated, in codons.
- --sliding_window_step: the number of codons to advance forward for each subsequent sliding window.
- --num_bootstraps: OPTIONAL. the number of bootstrap replicates to be performed for calculating the standard error of dN-dS for genes and sliding windows. DEFAULT: 0.
The format for using the script is:
snpgenie_within_group_processor.pl --codon_file=within_group_codon_results.txt --sliding_window_size=10 --sliding_window_step=1 --num_bootstraps=10000
A typical workflow thus includes one run of snpgenie_within_group.pl to obtain codon results, followed a run of SNPGenie_sliding_windows.R (previously snpgenie_within_group_processor.pl) for each coding product to obtain sliding window results with bootstrap results.
The script snpgenie_between_group.pl estimates mean dN and dS between two or more groups of sequences in FASTA format. Designed for use with sequence data that outsizes what can be handled by other software platforms, this script invokes parallelism for codons and bootstrapping, and as a result has one dependency: the Parallel::ForkManager module. If this isn't already installed, please use CPAN to install it before running the script. Click here to learn how.
Just run the script in a directory containing two or more FASTA files, each containing a group of aligned sequences which are also aligned to one another across groups (files), ending with a .fa, .fas, or .fasta extension. In other words, each FASTA file contains one group. For example, one might wish to compare sequences of the West Nile Virus 2K gene derived from three different mosquito vector species. In such a case, all sequences would first be aligned. Next, all sequences from mosquito species 1 would be placed together in one FASTA file; all sequences from mosquito species 2 would be placed in a second FASTA file; and all sequences from mosquito species 3 would be placed in a third FASTA file.
SNPGenie automatically detects all FASTA files in the working directory. The GTF file (see Gene Transfer Format for details) must be specified at the command line, along with an optional the number of bootstraps and processes:
snpgenie_between_group.pl --gtf_file=<CDS_annotations>.gtf --num_bootstraps=<integer> --procs_per_node=<integer>
Results files will be generated in the working directory. For a detailed explanation of the results, consult the output descriptions above (however, note that not all files and columns generated for the within-pool analysis will be present for the within-group analysis).
To execute bootstrapping and/or sliding window analyses, run the processing script SNPGenie_sliding_windows.R (see Sliding Windows and Bootstrapping).
SNPGenie_sliding_windows.R replaces the bootstrapping and sliding window script snpgenie_between_group_processor.pl, which is no longer supported. Its (deprecated) description is below:
- --between_group_codon_file: the name or path of the between_group_codon_results.txt file produced using snpgenie_between_group.pl.
- --sliding_window_size: the size of the sliding windows to be calculated, in codons.
- --sliding_window_step: the number of codons to advance forward for each subsequent sliding window.
- --codon_min_taxa_total: the minimum number of total distinct sequences or taxa across both groups having defined (non-gap) codons at a codon position. If this criterion is not met for the groups being compared, the codon position is ignored.
- --codon_min_taxa_group: the minimum number of distinct sequences or taxa in a single groups having defined (non-gap) codons at a codon position. If this criterion is not met for either of the groups being compared, the codon position is ignored.
- --num_bootstraps: OPTIONAL. the number of bootstrap replicates to be performed for calculating the standard error of dN-dS for genes and sliding windows. DEFAULT: 0.
- --procs_per_node: OPTIONAL. The number of parallel processes to be invoked for processing codons and bootstraps (speeds up SNPGenie if high performance computing resources are available). DEFAULT: 1.
The format for using the script is:
snpgenie_between_group_processor.pl --between_group_codon_file=<between_group_codon_results>.txt --sliding_window_size=10 --sliding_window_step=1 --codon_min_taxa_total=5 --codon_min_taxa_group=2 --num_bootstraps=10000 --procs_per_node=8
A typical workflow thus includes one run of snpgenie_between_group.pl to obtain codon results, followed a run of SNPGenie_sliding_windows.R (previously snpgenie_between_group_processor.pl) for each coding product to obtain sliding window results with bootstrap results.
Bootstrapping and/or sliding window analyses can be computed for any codon results file produced by snpgenie.pl
, snpgenie_within_group.pl
, or snpgenie_between_group.pl
. The codon results file must (1) contain one codon per line; (2) contain the columns N_diffs, N_sites, S_diffs, and S_sites; and (3) contain results for only one gene/product (no redundant codon numbers).
To run this analysis, use processing script SNPGenie_sliding_windows.R with the following unnamed arguments:
For example, to compute dN/dS or πN/πS in sliding windows of 40 codons with a step size of 1 codon; 1,000 bootstrap replicates per window; requiring a minimum of 100 defined codons (e.g., unambiguous reads) per position; using no multiple hits correction; and parallelizing over 6 CPUs, call the following:
SNPGenie_sliding_windows.R codon_results_oneProduct.txt N S 40 1 1000 100 NONE 6 > SNPGenie_sliding_windows_oneProduct.out
The output of this script provides the following TAB-delimited columns:
SNPGenie calculates gene and nucleotide diversities for sites in a protein-coding sequence. Nucleotide diversity, π, is the average number of nucleotide variants per nucleotide site for all pairwise comparisons. To distinguish between nonsynonymous and synonymous differences and sites, it is necessary to consider the codon context of each nucleotide in a sequence. This is why the user must submit the starting and ending sites of the coding regions in the .gtf file, along with the reference FASTA sequence file, so that the numbers of nonsynonymous and synonymous sites for each codon may be accurately estimated by a derivation of the Nei-Gojobori (1986) method.
SNPGenie first splits the coding sequence into codons, each of which contains 3 sites. The software then determines which are nonsynonymous and synonymous by testing all possible polymorphisms present at each site of every codon in the sequence. Because different nucleotide variants at the same site may lead to both nonsynonymous and synonymous polymorphisms, fractional sites may occur (e.g., only 2 of 3 possible nucleotide substitutions at the third position of AGA cause an amino acid change; thus, that site is considered 2/3 nonsynonymous and 1/3 synonymous). Next, the SNP report is consulted for the presence of variants to produce a revised estimate of the number of sites based on the presence of variant alleles. Variants are incorporated through averaging, weighted by their frequency. Although it is relatively rare, high levels of sequence variation may alter the number of nonsynonymous and synonymous sites in a particular codon.
SNPGenie calculates the number of nucleotide differences for each codon in each ORF as follows. For every variant in the SNP Report, the number of variants is calculated as the product of the variant’s relative frequency and the coverage at that site. For each variant nucleotide (up to 3 non-reference nucleotides), the number of variants is stored, and their sum is subtracted from the coverage to yield the reference’s absolute frequency. Next, for each pairwise nucleotide comparison at the site, it is determined whether the comparison represents a nonsynonymous or synonymous change. If the former, the product of their absolute frequencies contributes to the number of nonsynonymous pairwise differences; if the latter, it contributes to the number of synonymous pairwise differences. When comparing codons with more than one nucleotide difference, all possible mutational pathways are considered, per the method of Nei & Gojobori (1986). The sum of pairwise differences is divided by the total number of pairwise comparisons at the codon (nC2, where n = coverage) to yield the mean number of differences per site of each type. This is calculated separately for nonsynonymous and synonymous comparisons. Calculating nucleotide diversity codon-by-codon enables sliding-window analyses that may help to pinpoint important nucleotide regions subject to varying forms of natural selection.
For further background, see Nelson & Hughes (2015) in the References.
Previous versions of the additional scripts, provided to help in the preparation of SNPGenie input, have moved to the Evolutionary Bioinformatics Toolkit (EBT) repository. Here, you can find the following scripts for use in converting files to the reverse complement strand, to be run as follows:
vcf2revcom.pl <file_name>.vcf <seq_length>
gtf2revcom.pl <file_name>.gtf <seq_length>
fasta2revcom.pl <seq>.fa
Tajima.D()
to calculate Tajima's D (Tajima 1989). (WARNING: not extensively tested!) For example, for n=1000 sequences with S=15 segregating sites and total nucleotide diversity π=0.05, run as follows:Tajima.D(1000, 15, 0.05)
"coverage is 0" WARNING. SNPGenie will run even when some sites lack sequencing coverage. Sometimes this presents no problems, e.g., if all sequences lack coverage in the same regions or at terminal sites. However, if coverage differs between samples, this will affect diversity estimates and will need to be accounted for: one sample might have relatively low diversity simply because numerous sites lack coverage. Such a difference is not biological, but rather a result of incomplete data. As always, it is up to the user to understand their own data, and to only supply samples that are biologically meaningful.
Using Windows? Unfortunately, SNPGenie was written for Unix systems, including Mac, which have Perl installed by default. Windows sometimes doesn't, but getting Perl installed is as simple as following these three-minute download instructions, and with any luck the program will work! Just open the Windows Command Prompt, and remember to type "perl" first when you run SNPGenie, i.e., type "perl snpgenie.pl". Unfortunately, if this does not work, we are unable to provide any further Windows support.
Try preceding the whole command line with "perl" to make sure SNPGenie is being treated as a script. For example:
perl snpgenie.pl --minfreq=0.01 --snpreport=mySNPreport.txt --fastafile=myFASTA.fa --gtffile=myGTF.gtf
Make sure that the first line of the script contains the correct path to your machine's copy of Perl. Right now SNPGenie assumes it's located at /usr/bin/perl. You can find out if this matches your machine by typing which perl at the command line. If it's a different path than the one above, copy and paste your machine's path after the #! at the beginning of the SNPGenie script.
Make sure that the script is executable at the command line, as follows:
chmod +x snpgenie.pl
Make sure end-of-line characters are in Unix LF (\n) format. Although an attempt has been made to ensure SNPGenie can accept Windows CRLF (\r\n) or Mac CR (\r) formats, these can sometimes introduce unexpected problems causing SNPGenie to crash or return all 0 values. This is often a problem when saving Excel spreadsheets on Mac, which defaults to a Mac newline format. Trying changing the newline character to Unix LF using a free program such as TextWrangler.
Make sure CLC files are tab (\t)-delimited.
Make sure Geneious files are comma-separated.
Make sure the SNP reports contain all necessary columns. (See sections on Input above.)
Make sure the Frequency (CLC) or Variant Frequency (Geneious) SNP report columns contain a percentage, not a decimal (e.g., 11.0% rather than 0.11).
Make sure the SNP calling frame correct (e.g., do the codons for a product in the SNP report begin with ATG and end with TAA, TAG, or TGA?).
Make sure the product coordinates in the GTF file are correct. (You might use a free program such as MEGA to check that the CDS coordinates begin with ATG and end with TAA, TAG, or TGA, if that's what you expected.)
When using this software, please refer to and cite:
Nelson CW, Moncla LH, Hughes AL (2015) SNPGenie: estimating evolutionary parameters to detect natural selection using pooled next-generation sequencing data. Bioinformatics 31(22):3709-11. doi: 10.1093/bioinformatics/btv449.
and this page:
SNPGenie
was written with support from a University of South Carolina Presidential Fellowship (2011-2015), a National Science Foundation (NSF) Graduate Research Fellowship (GRF; 2013-2016), and an NSF East Asian and Pacific Summer Institutes (EAPSI; 2013) Fellowship. SNPGenie
has been further maintained by a Gerstner Scholars Fellowship from the Gerstner Family Foundation at the American Museum of Natural History (2016-2019). The logo image was designed by Elizabeth Ogle (2015).
If you have questions about using SNPGenie, please click on the "Issues" tab at the top of this page and begin a new thread, so that others might benefit from the discussion.
Other correspondence should be addressed to Chase W. Nelson: