CMU-SAFARI / Apollo

Apollo is an assembly polishing algorithm that attempts to correct the errors in an assembly. It can take multiple set of reads in a single run and polish the assemblies of genomes of any size. Described in the Bioinformatics journal paper (2020) by Firtina et al. at https://people.inf.ethz.ch/omutlu/pub/apollo-technology-independent-genome-assembly-polishing_bioinformatics20.pdf
GNU General Public License v3.0
27 stars 2 forks source link

Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

Apollo is an assembly polishing algorithm that attempts to correct the errors in an assembly. It can take multiple set of reads in a single run and polish the assemblies of genomes of any size.

Installing Apollo

git clone https://github.com/CMU-SAFARI/Apollo.git apollo
cd ./apollo
make
cd ./bin

Now you can copy this binary wherever you want (preferably under a directory that is included in your $PATH). Assuming that you are in the directory that the binary is located, you may run the command below to display the help message.

./apollo -h

Assembly polishing

Polishing using a single set of reads (i.e., non-hybrid):

Assume that you have 1) an assembly assembly.fasta, 2) a set of reads reads.fasta, 3) the alignment file alignment.bam that contains the alignment of the reads to the assembly, 4) and you would like to store polished assembly as polished.fasta. The command below uses 30 threads while polishing the assembly:

./apollo -a assembly.fasta -r reads.fasta -m alignment.bam -t 30 -o polished.fasta

Resulting fasta file polished.fasta will be the final output of Apollo.

Polishing using a hybrid set of reads:

Assume that you have 1) an assembly assembly.fasta, 2) a hybrid set of reads reads1.fasta and reads2.fasta, 3) the alignment of these reads to the assembly stored in alignment1.bam and alignment2.bam, respectively, 4) and you would like to store polished assembly as polished.fasta. The command below uses 30 threads while polishing the assembly:

./apollo -a assembly.fasta -r reads1.fasta -r reads2.fasta -m alignment1.bam -m alignment2.bam -t 30 -o polished.fasta

Resulting fasta file polished.fasta will be the final output of Apollo.

Supported and Required Input Files

Alignment File

samtools view -hb input.sam > input.bam
samtools view -h -F4 input.bam | samtools sort -m 16G -l0 > input_sorted.bam
samtools index input.bam

Set of Reads

>read1
TAT
TAT
ATT
A

or in a single line:

>read1
TATTATATTA

The restriction on the number of characters per line is required as Apollo constructs the index file (i.e., FAI file) for the input read set. Further information about indexing and the requirements can be found at: https://seqan.readthedocs.io/en/master/Tutorial/InputOutput/IndexedFastaIO.html

./apollo -a assembly.fasta -r reads1.fasta -r reads2.fasta -m alignment1.bam -m alignment2.bam -t 30 -o polished.fasta -c 1000

Example run

You may use the following test run to check whether everything works as intended with Apollo. Note that you must have curl to download the required files and also minimap2 to map the reads to the assembly.

#create a test folder
mkdir test; cd test
#download a read set that is publicly available by PacBio and only fetch small number of read set as this is a sanity check
curl -s http://datasets.pacb.com.s3.amazonaws.com/2014/c_elegans/additional_data/2590969/0002/Analysis_Results/m140928_104939_ethan_c100699582550000001823139903261541_s1_p0.3.subreads.fasta | head -5000 > pacbio.fasta
#download the already constructed assembly
curl -L -o assembly.fasta http://datasets.pacb.com.s3.amazonaws.com/2014/c_elegans/40X/polished_assembly/polished_assembly.fasta
#generate read-to-assembly file
minimap2 -x map-pb -a assembly.fasta pacbio.fasta | samtools view -h -F4 | samtools sort -m 16G -l0 > alignment.bam
#indexing the alignment file
samtools index alignment.bam
#polishing. Here we assume that "apollo" is in your $PATH. If not you should specify the exact path to "apollo"
apollo -a assembly.fasta -r pacbio.fasta -m alignment.bam -o polished.fasta -c 1000

Problems You May Encounter

Input Format

Publication and citing Apollo

If you would like to cite Apollo, please cite the following publication:

Can Firtina, Jeremie S. Kim, Mohammed Alser, Damla Senol Cali, A. Ercument Cicek, Can Alkan, and Onur Mutlu, "Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm" Bioinformatics, btaa179, 2020. [doi:10.1093/bioinformatics/btaa179][doi]