CAFS-bioinformatics / LR_Gapcloser

use long sequenced reads to close gaps in assemblies
GNU General Public License v3.0
35 stars 0 forks source link

LR_Gapcloser

LR_Gapcloser is a gap closing tool using long reads from studied species. The long reads could be downloaed from public read archive database (for instance, NCBI SRA database ) or be your own data. Then they are fragmented and aligned to scaffolds using BWA mem algorithm in BWA package. In the package, we provided a compiled bwa, so the user needn't to install bwa. LR_Gapcloser uses the alignments to find the bridging that cross the gap, and then fills the long read original sequence into the genomic gaps.

SYSTEM REQUIREMENTS

(1)Perl and Bioperl should be installed on the system.

(2)GLIBC 2.14 should be installed.

INSTALLING

1) After downloading the sofware, simply type "tar -zxvf LR_Gapcloser.tar.gz" in the installation directory. The software does not require any special compilation and is already provided as portable precompiled software. 2) Then for convenience ,you can type "export PATH=$PATH:your_directory/LR_Gapcloser/" to set the PATH environmental variables.

INPUT FILES

(1)The scaffold file is required and should be fasta format. The description line or header line, which begins with '>', provides a unique name and/or identifier for the sequence. And the name and/or identifier must not contain a "(:", because in data processing, we will use "(:" as delimiters.

(2)The long reads file is also required and should be fasta format and the reads must be error corrected. If the file is fastq format, it should be converted into fasta format before running the software.

COMMANDS AND OPTIONS

LR_Gapcloser is run via the shell script: LR_Gapcloser.sh, which could be found in the base installation directory.

Usage info is as follows:

in Centos system, use "sh LR_Gapcloser.sh -i Scaffold_file -l Corrected-PacBio-read_file -s p "

in Ubuntu system, use "bash LR_Gapcloser.sh -i Scaffold_file -l Corrected-PacBio-read_file -s p"

Input options

-i the scaffold file that contains gaps, represented by a string of N [ required ]

-l the raw and error-corrected long reads used to close gaps. The file should be fasta format. [ required ]

-s sequencing platform: pacbio [p] or nanopore [n] [ default: p ]

-t number of threads (for machines with multiple processors), used in the bwa mem alignment processes and the following coverage filteration. [ default: 5 ]

-c the coverage threshold to select high-quality alignments [ default: 0.8 ]

-a the deviation between gap length and filled sequence length [ default: 0.2 ]

-m to select the reliable tags for gap-closure, the maximal allowed distance from alignment region to gap boundary (bp) [ default: 600 ]

-n the number of files that all tags were divided into [ default: 5 ]

-g the length of tags that a long read would be divided into (bp) [ default: 300 ]

-v the minimal tag alignment length around each boundary of a gap (bp) [ default: 300 ]

-r number of iteration [ default: 3 ]

-o name of output directory [ default: ./Output]

OUTPUT FILES

LR_Gapcloser generated a file named as gapclosed.fasta in the sub-directory of "iteration-3" of the output directory.