NCBI-Hackathons / MASSIF-BLAST

A toolkit of pipelines to assess and repair badly assembled genomes
MIT License
3 stars 0 forks source link

MASSIF-BLAST

Modular ASSembly Improvement Framework using BLAST

Introduction

Introduction Over the past 10 years, the quality of genome sequencing and assembly has improved with the advancement of both experimental and analysis techniques. However, badly assembled genomes are still used in research, particularly if the assembly is old or if the genome is complex.

We offer a suite of modules that quickly assesses the quality of a genome assembly and improves it all in one pipeline.

Currently, a user can run individual modules depending on their needs. Pipelining the modules so that users can run them in tandem will be added in the future. Other modules, such as an annotator of exogenous virus contamination or a polyploid detector, can be easily incorporated later on.

Setup

System

System needs linux with docker compatibility. Depending on the module, the size of the genome of interest, and the amount of supporting data, the pipeline can require a significant amount of computational resources. We recommend using a computer that has at least 4 processor cores, 16 Gb of RAM, and approximately 3 Gb hard drive space.

Use

Please see our docker README.md

Module 1 - Pipeline for the Quick Assement of a Genome Assembly

This module takes a list of conserved genes which can be at a phylum taxonomic level (or lower) and compares the protein sequence of a reference genome to the assembly of interest. Poor assemblies will likely exhibit frameshifts or incomplete protein sequences. The output file is a tab-deliminated file comparing its protein quality how many of the conserved genes had poor protein sequence in the assembly of interest.

Use Cases

Workflow

Mod 1 Workflow

Use

Module 2 - Assembly Improving Pipeline with DNA-sequencing data using Pilon

This module improves assemblies using DNA-sequencing data that can either be supplied by the user or pulled from NCBI by querying based on the species name or supplying a list of accession numbers. The output contains statistics about the improved assembly, such as percent of the genome changed and number of gaps and mismatches, as well as positional information about where these improvements were made.

Use Cases

Workflow

Mod 2 Workflow

Use

To run module 2 with SRA accession numbers:
sh module2.sh --genome sample_genome.fna --acc ACC1 ACC2 ACC3

To run module 2 with local SRA data files:
sh module2.sh --genome sample_genome.fna --dnafile file1.fna file2.fna

An output directory can be specified with --outdir, otherwise output is saved to directory in which the program was run. To keep all intermediate data use the '-k' or '--keep' command. A usage or help page is available with the '-h' or '--help' command.

Module 3 - Assembly Improving Pipeline with RNA-sequencing data using rascaf

This module improves assemblies using RNA-sequencing data that can either be supplied by the user or pulled from NCBI by querying based on the species name or supplying a list of accession numbers. The output contains statistics about the improved assembly, such as percent of the genome changed and number of gaps and mismatches, as well as positional information about where these improvements were made.

Use Cases

Workflow

Mod 3 Workflow

Use

To run module 3 with SRA accession numbers:
sh module3.sh --genome sample_genome.fna --acc ACC1 ACC2 ACC3

To run module 3 with local SRA data files:
sh module3.sh --genome sample_genome.fna --rnafile file1.fna file2.fna

An output directory can be specified with --outdir, otherwise output is saved to directory in which the program was run. To keep all intermediate data use the '-k' or '--keep' command. A usage or help page is available with the '-h' or '--help' command.

Testing

Future Directions

Here are some ideas for future modules!

People