chadlaing / Panseq

Pan-genomic sequence analysis
http://lfz.corefacility.ca/panseq
GNU General Public License v3.0
43 stars 14 forks source link
accessory-genome comparative-genomics core-genome genome-sequencing pan-genome snps

Master branch build status

OVERVIEW

Panseq determines the core and accessory regions among a collection of genomic sequences based on user-defined parameters. It readily extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions, constructs files for use in phylogeny programs based on both the presence/absence of accessory regions and SNPs within core regions.

It also provides a loci selector that efficiently computes the most discriminatory loci from a tab-delimited dataset.

If you find Panseq useful please cite:

Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VP. BMC Bioinformatics. 2010 Sep 15;11:461.

USAGE

The Panseq standalone script can be accessed from:

lib/panseq.pl

The loci finder can be accessed from:

lib/loci_selector.pl

SETUP

Panseq requires Perl 5.10 or greater, and the following CPAN package to be installed:

Module::Build

This package can be installed from the command-line using cpan -i Module::Build. Following this, to install, and automatically retrieve the required CPAN packages for Panseq, do the following:

perl Build.PL
./Build installdeps

The following free, external programs must also be installed:

and optionally

Testing your installation

perl t/output.t

This will run a test suite against the included test data to ensure that Panseq is configured and working correctly. All tests should pass. The cd-hit tests will only be run if cd-hit is found. Panseq checks for the installed programs on the local path, but they can be optionally specified as follows:

perl t/output.t --blastDirectory '/home/user/local_blast/' --mummerDirectory '/home/user/local_mummer/' --cdhitDirectory '/home/user/cdhit' --muscleExecutable '/home/dir/muscle_executable'

Running Panseq

All the adjustments to Panseq are made by modifying a tab-delimited configuration file, which is specified as the only argument to the script.

perl lib/panseq.pl settings.txt

Below is an example configuration file for panseq.pl:

queryDirectory  /home/phac/panseq/queryLarge/
referenceDirectory  /home/phac/panseq/referenceLarge/
baseDirectory   /home/phac/panseq/output/panseq2test/
numberOfCores   22
mummerDirectory /home/phac/MUMmer3.23/
blastDirectory  /home/phac/ncbi-blast-2.2.29+/bin/
minimumNovelRegionSize  500
novelRegionFinderMode   no_duplicates
muscleExecutable    /usr/bin/muscle3.8.31_i86linux64
fragmentationSize   500
percentIdentityCutoff   85
coreGenomeThreshold 2
runMode     pan

Advanced Options

queryFile   /home/phac/fileOfQuerySequence.fasta
cdhitDirectory  /home/phac/cd-hit/
storeAlleles    1
allelesToKeep   2
nameOrId    name
frameshift  1
overwrite   1
maxNumberResultsInMemory    500
blastWordSize   11
nucB    200
nucC    65
nucD    0.12
nucG    90
nucL    20
cdhit   1
sha1    1

Settings and their [DEFAULTS]

Format of multi-fasta files

Panseq currently only accepts fasta or multi-fasta formatted files. More than one genome may be in a single file, but for all genomes consisting of more than one contig, a distinct identifier must be present in the fasta header of each contig belonging to the same genome. For example, you have just assembled a new genome and are eager to analyze it. Your file consists of a number of contigs, similar to:

>contig000001
ACTGTTT...

>contig000002
CGGGATT...

The unique identifier could be the strain name or anything else of your choosing, but it must be included using the "local" designation: lcl|unique_identifer|. To reformat the above contigs, find and replace all ">" characters in your multi-fasta file with >lcl|unique_identifer|. Thus, if the unique identifier were "strain1", the reformatted contigs would look as follows:

>lcl|strain1|contig000001
ACTGTTT...

>lcl|strain1|contig000002
CGGGATT...

Common database file formats are supported by default, such as ref|, gb|, emb|, dbj|, and gi| and do not need to be modified as described above. For legacy purposes, the name=|unique_identifier| is supported in addition to lcl|uniqueidentifier|. Please note that spaces are not permitted in the unique identifier. Only letters (A-Z, a-z), digits (0-9) and the underscore "" are valid characters.

Description of output files

Detailed explanation of Panseq

Novel Region Finder

The Novel Region Finder currently has two modes implemented: "no_duplicates" and "unique". The no_duplicates mode identifies any genomic regions present in any of the query sequences that are not present in any of the reference sequences, and returns these regions in multi-fasta format. The "unique" mode finds genomic regions that are unique to each of the query sequences and not present in any of the reference sequences.

These comparisons are done using the nucmer program from MUMmer 3, the parameters of which can be adjusted by the user prior to submitting an analysis. A sample of the output from novelRegions.fasta looks as follows:

>lcl|strain1|contig000001_(505..54222)
CCGTACGGGATTA...

Where the name of the contig containing the novel region is listed, followed by the nucleotide positions that were determined to be "novel" based on the comparison run. Lastly, the corresponding novel nucleotide sequence is included.

Core / Accessory Analysis

All of the query strains are used to determine a non-redundant pan-genome. This is done by choosing a seed sequence for the pan-genome and iteratively building the pan-genome by comparing non-seed sequences to the "pan-genome", using the Novel Region Finder described above. For each comparison, sequences not present in the "pan-genome" are added, and the expanded "pan-genome" is used for the comparison against the next sequence. This iteration continues until a non-redundant pan-genome has been constructed.

Following the creation of this pan-genome for the selected sequences, the pan-genome is segmented into fragments of user-defined size. These fragments are subsequently queried against all of the sequences in the query list using blastn. The presence or absence of each pan-genome fragment is determined for each query, based on the Sequence Identity threshold set by the user. Pan-genome fragments present in a minimum number of genomes (determined by the core genome threshold) are aligned using Muscle.

Single nucleotide polymorphisms (SNPs) in these alignments are determined and used to generate a Phylip formatted file of all SNPs for use in downstream phylogenetic analyses. A phylip formatted file of pan-genome fragment presence / absence is also created, as are tab-delimited tables for both SNP and pan-genome fragment presence / absence (snp_table.txt and binary_table.txt, respectively), and a detailed result file listing the names, positions and values of the SNP and binary data (core_snps.txt and pan_genome.txt).

The detailed results file provides data in six columns: Locus Id, Locus Name, Genome, Allele, Start bp and Contig. Locus Id will be a 10-digit number that is a unique identifier for the locus; this number is used in the tabular output for both the SNP and binary data, and can be used for cross-referencing. Locus Name and Genome provide the human-readable names for the locus and the Genome. The Allele columns lists the actual data for the comparison. For pan-genome fragment presence / absence this is binary "0" or "1" data. For the SNP table, this is "A", "C", "T", "G" or "-". Start bp refers to the nucleotide position of the locus in base pairs, for example "45933"; the nucleotide position information is for the start of the fragment for the binary data. Lastly, the Contig column lists the name of the contig the locus is found on, which may differ than the Genome column, for genomes comprised of multi-fasta files.

Running the loci selector

The loci selector takes two command line arguments, with an optional third:

perl loci_selector.pl input_file number_of_loci maximize_pod > output_file

The number_of_loci can be any positive integer, or 'best', which will find the minimum number of loci to provide a unique fingerprint for each data column, if possible. The default value of maximize_pod is 0, but can be set to 1 to disable masking of previously used locus pairs when calculating the points of discrimination. The default is recommended. Output is to STDOUT, and can be redirected to output_file.

Detailed explanation of loci selector

The loci selector constructs loci sets that are maximized with respect to the unique number of fingerprints produced among the input sequences as well as the discriminatory power of the loci among the input sequences. The final loci set is iteratively built, in the following steps, given a tab-delimited table with loci names in the first column, sequence names in the first row, and single character data filling the matrix. Missing data is denoted by the characters '?', '-', or '.' :