appliedbinf / URDO-SMOREd

Sequence Matching fOr REpiratory Diseases, SMORE'D, is a command-line sequence classification tool tailored to meet the needs of the Undiagnosed Respiratory Disease Outbreak (URDO) branch at CDC. SMORE'D is a k-mer based classification tool capable of rapidly classifying read sequences generated by multi-pathogen detection platforms.
Other
1 stars 1 forks source link
bacteria bioinformatics genomics microbial-genomics ngs ngs-analysis pathogen-detection public-health taxonomy-assignment

CircleCI codecov PyPI - Python Version

Readme for SMORE'D

Overview

Sequence Matching fOr REpiratory Diseases, SMORE'D, is a command-line sequence classification tool tailored to meet the needs of the Undiagnosed Respiratory Disease Outbreak (URDO) branch at CDC. SMORE'D is a k-mer based classification tool capable of rapidly classifying read sequences generated by multi-pathogen detection platforms. These platforms use targeted amplification and whole genome sequencing of bacterial and viral organisms in clinical samples generating datasets of unidentified amplicon sequences. SMORE'D classifies these amplicon sequences at the level of annotation desired for each target-specific assay, whether it be identification or phenotypic characterization. Using a complete and well-curated database of representative target sequences as input, SMORE'D works in two steps. First, SMORE'D builds a k-mer database for the supplied representative target sequences (this is done only once for a given set of target sequences). It then classifies amplicons from paired-end reads and generates a report of all identified targets and the number of reads matching each target. SMORE'D creates an optional single sample report in Excel format that summarizes the sample and provides read counts and relative abundance of all identified organisms.

=============================================================================================

Usage

smored
[--buildDB]
[--predict]
[-1 filename_fastq1][--fastq1 filename_fastq1]
[-2 filename_fastq2][--fastq2 filename_fastq2]
[-d directory][--dir directory][--directory directory]
[-c][--config]
[-P][--prefix]
[-a]
[-k]
[-o output_filename][--output output_filename]
[-x][--overwrite]
[-r]
[-v]
[-h][--help]

==============================================================================================

There are two steps to sequence matching using smored.

  1. Create DB : smored --buildDB
  2. Predict : smored --predict

1. smored --buildDB

Synopsis: smored --buildDB -c <config file> -k <kmer length(optional)> -P <DB prefix(optional)>
config file : is a tab delimited file which has the information for reference sequences, their multifasta files and profile definition file. Format :

[loci]  
amplicon    ampliconFile
[profile]
profile   profileFile

kmer length : is the kmer length for the db. Note, while processing this should be smaller than the read length.

Required arguments --buildDB
Identifier for build db module
-c,--config = <configuration file>
Config file in the format described above.

Optional arguments
-k = <kmer length> Kmer size for which the db has to be formed(Default k = 35).
-P,--prefix = <prefix>
Prefix for db and log files to be created(Default = kmer). Also you can specify folder where you want the dbb to be created.
-a File location to write build log
-h,--help
Prints the help manual for this application


2. smored --predict

smored --predict : can run in two modes 1) single sample (default mode) 2) multi-sample: run smored for all the samples in a folder

Synopsis smored --predict -1 <fastq file> -2 <fastq file> -d <directory location> -P <DB prefix(optional)> -k <kmer length(optional)> -o <output file> -x

Required arguments -c,--config = <configuration file> Config file in the format described above.

-1,--fastq1 = <fastq1_filename>
Path to first fastq file for paired end sample.

Optional arguments --predict Identifier for predict module - this is the default function of SMORE'D -1,--fastq1 = <fastq1_filename>
Path to first fastq file for paired end sample.