Modular ASSembly Improvement Framework using BLAST
Over the past 10 years, the quality of genome sequencing and assembly has improved with the advancement of both experimental and analysis techniques. However, badly assembled genomes are still used in research, particularly if the assembly is old or if the genome is complex.
We offer a suite of modules that quickly assesses the quality of a genome assembly and improves it all in one pipeline.
Currently, a user can run individual modules depending on their needs. Pipelining the modules so that users can run them in tandem will be added in the future. Other modules, such as an annotator of exogenous virus contamination or a polyploid detector, can be easily incorporated later on.
System needs linux with docker compatibility. Depending on the module, the size of the genome of interest, and the amount of supporting data, the pipeline can require a significant amount of computational resources. We recommend using a computer that has at least 4 processor cores, 16 Gb of RAM, and approximately 3 Gb hard drive space.
Please see our docker README.md
This module takes a list of conserved genes which can be at a phylum taxonomic level (or lower) and compares the protein sequence of a reference genome to the assembly of interest. Poor assemblies will likely exhibit frameshifts or incomplete protein sequences. The output file is a tab-deliminated file comparing its protein quality how many of the conserved genes had poor protein sequence in the assembly of interest.
This module improves assemblies using DNA-sequencing data that can either be supplied by the user or pulled from NCBI by querying based on the species name or supplying a list of accession numbers. The output contains statistics about the improved assembly, such as percent of the genome changed and number of gaps and mismatches, as well as positional information about where these improvements were made.
To run module 2 with SRA accession numbers:
sh module2.sh --genome sample_genome.fna --acc ACC1 ACC2 ACC3
To run module 2 with local SRA data files:
sh module2.sh --genome sample_genome.fna --dnafile file1.fna file2.fna
An output directory can be specified with --outdir
, otherwise output is saved to directory in which the program was run. To keep all intermediate data use the '-k'
or '--keep'
command. A usage or help page is available with the '-h'
or '--help'
command.
This module improves assemblies using RNA-sequencing data that can either be supplied by the user or pulled from NCBI by querying based on the species name or supplying a list of accession numbers. The output contains statistics about the improved assembly, such as percent of the genome changed and number of gaps and mismatches, as well as positional information about where these improvements were made.
To run module 3 with SRA accession numbers:
sh module3.sh --genome sample_genome.fna --acc ACC1 ACC2 ACC3
To run module 3 with local SRA data files:
sh module3.sh --genome sample_genome.fna --rnafile file1.fna file2.fna
An output directory can be specified with --outdir, otherwise output is saved to directory in which the program was run. To keep all intermediate data use the '-k' or '--keep' command. A usage or help page is available with the '-h' or '--help' command.
Here are some ideas for future modules!