bioinfologics / satsuma2

FFT cross-correlation based synteny aligner, (re)designed to make full use of parallel computing
41 stars 13 forks source link

Satsuma2

Satsuma2 is an optimised version of Satsuma, a tool to reliably align large and complex DNA sequences providing maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accomodate the billions of base pairs in vertebrate genomes). Satsuma2 adresses these issues through three novel strategies:

Satsuma2 also interfaces with MizBee, a multi-scale synteny browser for exploring conservation relationships in comparative genomics data (http://www.cs.utah.edu/~miriah/mizbee/Overview.html).

Satsuma2 is implemented in C++ on Linux.

Licensing

Satsuma2 is free software: you can redistribute it and/or modify it under the terms of the Lesser GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public License for more details.

Citing Satsuma2

Please contact us to discuss how you can cite Satsuma2.

Installation

Download the source code from https://github.com/bioinfologics/satsuma2.git and compile it using CMake v3.3+. To run, Satsuma2 requires GCC v5.2+. The binaries are generated in the bin/ directory.

NOTE: if you encounter the error "... undefined reference to `pthread_create'" during compilation, add the flag -pthread to CMakeLists.txt, i.e. change:

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread -std=c++14 -O3 -w") to: set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -lpthread -pthread -std=c++14 -O3 -w")

We are experimenting with providing pre-compiled binaries which can be downloaded from the releases section. Once downloaded, you should be able to run these directly.

Quick start

  1. Set the SATSUMA2_PATH environment variable to point to the directory containing the binaries.
  2. Configure satsuma_run.sh to match your job submission system (LSF, PBS, SLURM etc.)
  3. Run SatsumaSynteny2 with default parameters (uses one single-threaded slave);
./SatsumaSynteny2 -q query.fa -t target.fa -o output_dir

You can test the installation using the test dataset provided (dog vs. human) by running the test_SatsumaSynteny2 script.

Running Satsuma2

As Satsuma2 calls other executables (HomologyByXCorr, MergeXCorrMatches etc.), you need to set an environment variable to tell the software where to find the binaries;

export SATSUMA2_PATH=/path/to/binaries

Running SatsumaSynteny2 with no command-line arguments shows the available parameters;

$ ./SatsumaSynteny2

./SatsumaSynteny2: Wrapper around high-sensitivity alignments.

Available arguments:

-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-sl_mem<int> : memory requirement for slaves (Gb) (def=100)
-do_refine<bool> : refinement steps (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-cutoff<double> : signal cutoff (def=1.8)
-prob_table<bool> : approximate match prob using a table lookup in slaves (def=0)
-min_matches<int> : minimum matches per target to keep iterating (def=20)
-m<int> : number of jobs per block (def=4)
-slaves<int> : number of processing slaves (def=1)
-threads<int> : number of working threads per processing slave (def=1)
-km_mem<int> : memory required for kmatch (Gb) (def=100)
-km_sync<bool> : run kmatch jobs synchronously (def=1)
-seed<string> : loads seeds and runs from there (kmatch files prefix) (def=)
-min_seed_length<int> : minimum length for kmatch seeds (after collapsing) (def=24)
-max_seed_kmer_freq<int> : maximum frequency for kmatch seed kmers (def=1)
-old_seed<string> : loads seeds and runs from there (xcorr*data) (def=)
-pixel<int> : number of blocks per pixel (def=24)
-nofilter<bool> : do not pre-filter seeds (slower runtime) (def=0)
-dups<bool> : allow for duplications in the query sequence (def=0)
-dump_cycle_matches<bool> : dump matches on each cycle (for debug/testing) (def=0)

Required parameters are query FASTA (-q), target FASTA (-t) and output directory (-o). The query and target sequences are chunked (based on the -t_chunk and -q_chunk parameters) then KMatch is used to detect aligning regions between chunks. The number of chunks generated depends on the length of your query and target sequences. The amount of memory reserved for KMatch can be modified using the -km_mem parameter which defaults to 100Gb.

SatsumaSynteny2 despatches slave processes to compare the chunks which run asynchronously. The number of slaves, threads per slave and memory limit per slave are specified using the -slaves, -threads and -sl_mem parameters. The default is to run one single-threaded slave using 100Gb of memory. Slaves can be run on a single machine or submitted via a job submission system such as LSF, PBS or SLURM and the satsuma_run.sh file is used by SatsumaSynteny2 to start the slaves. Before running SatsumaSynteny2, you need to modify this file to suit your HPC environment by commenting out the lines you don't need with #. You will also need to change 'QueueName' to a queue that exists on your system. For example, to run on SLURM your file should look like this;

#!/bin/sh

# Script for starting Satsuma jobs on different job submission environments
# One section only should be active, ie. not commented out

# Usage: satsuma_run.sh <current_path> <kmatch_cmd> <ncpus> <mem> <job_id> <run_synchronously>
# mem should be in Gb, ie. 100Gb = 100

# no submission system, processes are run locally either synchronously or asynchronously
#if [ "$6" -eq 1 ]; then
#  eval "$2"
#else
#  eval "$2" &
#fi

##############################################################################################################
## For the sections below you will need to change the queue name (QueueName) to one existing on your system ##
##############################################################################################################

# qsub (PBS systems)
#echo "cd $1; $2" | qsub -V -qQueueName -l ncpus=$3,mem=$4G -N $5

# bsub (LSF systems)
#mem=`expr $4 + 1000`
#bsub -o ${5}.log -J $5 -n $3 -q QueueName -R "rusage[mem=$mem]" "$2"

# SLURM systems
echo "#!/bin/sh" > slurm_tmp.sh
echo srun $2 >> slurm_tmp.sh
sbatch -p QueueName -c $3 -J $5 -o ${5}.log --mem ${4}G slurm_tmp.sh

Notes

Parameter choice, execution and data preparation

Making SatsumaSynteny2 converge (a temporary note)

Given a new and more exhaustive convergence model, which is still under active development, Satsuma2 may fail to converge into a single final result, and rather enter an iteration cycle, where lots of small (or not so small) changes are made to the general alignment search strategy. Instead of hiding this behaviour under a fixed cutoff after a number of iterations, we have chosen to expose it, and allow the user to examine and choose the intermediate result that best suits the biological question.

For this reason, we have introduced the parameter -dump_cycle_matches, which will produce an output file on each cycle. Because these output files are not particularly large and contain the whole information of the cycle, we recommend to turn this parameter on, unless you're running on datasets where you already know that the convergence setup will work correctly. You can then examine the convergence (probably using the MatchesByFeature tool if one of your genomes is annotated) and decide which solution(s) best suits your objectives.

We are working on generalising the convergence model so it behaves well under most circumstances, but still this will always be a recommendation when starting to run SatsumaSynteny2 in new scenarios.

Output files

/satsuma_summary.chained.out: final coordinates

Contents:  
    target sequence name  
    first target base  
    last target base  
    query sequence name  
    first query base  
    last query base  
    identity  
    orientation  

EXAMPLE:
chrX    5947    6164    chrX    9153    9360    0.626728    + 
chrX    6270    6452    chrX    9472    9654    0.576923    +

Note: ‘space’ in fasta names is permissible for alignment, but all spaces will be replaced with “_” in the output files.

/MergeXCorrMatches.chained.out: final readable alignments

EXAMPLE:
Query chr24 [29727636-29727834] vs target scaffold_24 [1206-1404] + length 198 check 198
Identity (w/ indel count): 52.5253 %
-------------------------------------------------------------------------------

TCCCCACTTCTAAAGTAAACTGCACATAGGGACTTCTTTCCAAAGAGCACAGTCTGGAAAGGAGGGAAAAACAATTTTAC
       ||  |||      |  | ||     ||   ||||||| |  || ||  | |||||    || || ||||| ||
ATATATTTTTAAAATATCTATTAAAATCAAACCTATGTTCCAAATATTACGGTACGAAAAGGGAAAAATAAGAATTTCAC

WYMYMWYTTYWAAAKWWMWMTKMAMATMRRRMCTWYKTTCCAAAKAKYACRGTMYGRAAAGGRRRRAAWAASAATTTYAC AMB
-------------------------------------------------------------------------------

AGTCTATAAACCTGATAAACACTACCTCAGCCAGGTGCTCAAGGGCAACATCAAGACTCGTAAGTCATGTTGATAGTAGA
||   | ||  |||  |  ||| |||| | ||| ||| ||||||  ||  |||    || | |||||  ||||||  |
AGCAAAGAAGTCTGGCAGTCACCACCTTAACCAAGTGATCAAGGTTAATGTCACTGATCATGAGTCACATTGATATAATG

AGYMWAKAARYCTGRYARWCACYACCTYARCCARGTGMTCAAGGKYAAYRTCAMKRMTCRTRAGTCAYRTTGATAKWAKR AMB
-------------------------------------------------------------------------------

TCCTATTGATATGCTTTGCAAGGACAGAGTAATTGACA
| |     ||||| | ||    ||  |     |   |
TACCCCCCATATGATGTGATGAGAAGGGCATTTCACCT

TMCYMYYSATATGMTKTGMWRRGAMRGRSWWWTYRMCW AMB

/xcorr_aligns_final.out - results in binary format for downstream analysis

Other Satsuma tools

Run each tool with no arguments to see available options.

Alignment tool

Visualisation tools

Scaffold synteny tools

Other useful tools

./MatchesByFeature GFF_file GFF_feature1 GFF_feature2 GFF_feature3 GFF_feature4 GFF_feature5 GFF_feature6 GFF_feature7 match_file1 match_file2

For example, the following command will show matches to exon and CDS features defined in the GFF file genome.gff using match files match1 and match2;

./MatchesByFeature genome.gff exon CDS - - - - - match1 match2