Lythimus / PARSES

PARSES: A Pipeline for Analysis of RNA-Sequencing Exogenous Sequences, is a pipeline constructed from existing sequence analysis tools that allows the user to interrogate RNA-Sequencing experiments for possible biological contamination or the presence of exogenous sequences that may shed light on other factors influencing an organism's condition. Built upon Rake, PARSES allows data simple analysis of RNA-Sequencing data in an automated and repeatable fashion. Usability, logging, and versatility are the keynotes of this pipeline.
6 stars 1 forks source link

PARSES V0.45

For more information see wiki

Requirements

PARSES is intended to be executed with Solexa data on a *NIX-based desktop computer. It may not be used for any sort of financial gain. Licenses for both MEGAN and Novoalign are strictly for non-profit research at non-profit institutions and academic usage.

Supported Data Types

System Requirements

PARSES's system requirements are directly dependent on the size of the data set being processed. It is recommended to be run on a Linux or OS X 64-bit machine with at least 4GBs of memory. You must also have root privileges to the machine if you are installing software.

Software Requirements

Installation

Execute all commands from the directory of the data. Place all scripts for PARSES into a single directory and mark as executable. All latest versions of programs will be installed, in the event of an error during installation a repository of the programs may be used which is not guaranteed to be up to date. The repository can be activated by including the repo=true command. Installation will automatically be performed during any execution but it can be manually performed by evoking any of the following installation commands:

sudo rake -f /rake/file/location install #installs and indexes all resources.
sudo rake -f /rake/file/location novoalignInstall
sudo rake -f /rake/file/location bowtieInstall
sudo rake -f /rake/file/location hgInstall
sudo rake -f /rake/file/location novoIndex
sudo rake -f /rake/file/location bowtieIndex
sudo rake -f /rake/file/location samtoolsInstall
sudo rake -f /rake/file/location tophatInstall
sudo rake -f /rake/file/location abyssInstall
sudo rake -f /rake/file/location blastInstall
sudo rake -f /rake/file/location ntInstall
sudo rake -f /rake/file/location meganInstall
sudo rake -f /rake/file/location parallelIteratorInstall

Examples

Example executions.

First Execution

rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3

Subsequent Executions

rake -f /rake/file/location seq=NameYouGiveToYourSequence

Install using repository of links

rake -f /rake/file/location repo=true install

Run to Specified Point

rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3 localAlignContigs

Run truncated version of PARSES (only execute the specified task and ignore prerequisites)

rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3 truncate=true localAlignContigs

Run truncated version of PARSES (only execute the specified task and ignore prerequisites) and override the file naming schema

rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourFileToProcess type=illumina1.3 truncate=true forcefile=true localAlignContigs

List All Tasks

rake -f /rake/file/location -T

Output Files

datafile.fastq

datafile.fastq.novo

datafile.fastq.novo.NM.fasta

datafile.fastq.novo.NM.fasta.nospans

datafile.fastq.novo.NM.fasta.nospans.blast

datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma

datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa

datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast

datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast.megan.rma


Task List

Basic Tasks

  1. alignSequence # Novoalign - Align reads in to base genome
  2. removeHuman # Xnovotonm - Harvest non-base organism reads
  3. removeSpans # Tophat - Align spanning reads in order to remove base organism reads
  4. localAlignReads # BLAST - Associate reads with organisms.
  5. metaGenomeAnalyzeReads # MEGAN - Separate reads into taxonomies.
  6. denovoAssembleCluster # ABySS - Assemble reads associated with clusters of taxonomies.
  7. localAlignContigs # BLAST - Associate contigs with organisms.
  8. metaGenomeAnalyzeContigs # MEGAN - Separate contigs into taxonomies.

Special Tasks

Installation Tasks


Analyzing Results

MEGAN

The primary method for viewing results involves perusing the RMA file produced by MEGAN--though a PDF is also produced with standard LCA parameters. To familiarize yourself with MEGAN you can watch the MEGAN Introduction Youtube video.

Your results will resemble the following:

For a detailed example of what results changing LCA parameters will have click the image below (or here for a PDF version).

Report

Reporting is currently stored in logs.


Logging

A log file is produced in the data directory with the filename sequenceName.log. It not only logs all commands executed, in addition to errors returned, but it also contains report information on the results of the data analysis. Below is a list of all information logged at each step of PARSES.

Non-task specific

Initialization

alignSequence (Novoalign)

removeHuman (Xnovotonm)

removeSpans (Tophat)

localAlignReads (BLAST+)

metaGenomeAnalyzeReads (MEGAN)

denovoAssembleCluster (ABySS)

localAlignContigs (BLAST+)

metaGenomeAnalyzeContigs (MEGAN)

Below is an example of a log file:

# Logfile created on Mon Nov 29 20:21:37 -0600 2010 by logger.rb/22285
I, [2010-12-27T14:54:16.761320 #94508]  INFO -- : Begin run for seq=akata file=s_4_sequence_Akata.txt type=illumina1.3 task=-f
I, [2010-12-27T14:54:16.849927 #94508]  INFO -- : Executing PARSES v0.30 in a 0 environment with 64GB of memory and 24 cores with a 64-bit architecture.
I, [2010-12-27T17:56:20.865114 #94508]  INFO -- : tophat -p 24 --solexa1.3-quals --output-dir akata_tophat_out /usr/share/hgChrAll s_4_sequence_Akata.txt
I, [2010-12-27T17:56:20.865442 #94508]  INFO -- : samtools view -h -o akata_tophat_out/accepted_hits.sam akata_tophat_out/accepted_hits.bam
I, [2010-12-27T17:56:20.865475 #94508]  INFO -- : Xextractspans.pl akata_tophat_out/accepted_hits.sam
I, [2010-12-27T17:56:20.865506 #94508]  INFO -- : Xfilterspans.pl s_4_sequence_Akata.txt.novo.NM.fasta akata_tophat_out/accepted_hits.sam.spans

Configuration

PARSES Configuration

A configuration file is produced for PARSES in $HOME/.PARSES. It contains the paths to the human genome and NT databases as well as the paths to the Bowtie and TopHat indices. It has the following form:

--- !ruby/object:Settings
bowtieIndex: /usr/share/hgChrAll
humanGenomeDatabase: /usr/share
novoIndex: /usr/share/hgChrAll.ndx
ntDatabase: /usr/share/nt/nt

Sequence Configuration

In addition, configuation files are produced for each sequence in $(pwd)/.sequenceName. Settings for each program are chosen by default but can be changed via the sequence file which has the following form:

--- !ruby/object:Sequence
abyssPath: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma
abyssPathGlob: reads-*.fasta
blast1Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans
blast2Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa
blastOutputFormat: 0
blastPathGlob: reads-*.fasta.*.kmer.contigs.fa.kmerOptimized.fa
dataType: illumina1.3
eValue1: 1.0e-06
eValue2: 100
expansionNumber: 10
filePath: s_4_sequence_Akata.txt
imageFileType: jpg
maxKmerLength: 38
maxMatches: 0
megan1Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast
megan2Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast
minKmerLength: 7
minScoreByLength: 0
minSupport: 5
novoalignPath: s_4_sequence_Akata.txt.novo
pipeEndPath: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast.megan.rma
readLength: 38
removeNonMappedPath: s_4_sequence_Akata.txt.novo.NM.fasta
topPercent: 10.0
useCogs: "false"
useGos: "false"
winScore: 0.0

It is recommended the path variables not be adjusted. Everything except expansionNumber, which is the number of times to execute the expansion command when generating a picture from MEGAN is straight-forward.

System Configuration

In addition, the amount of RAM, number of CPU cores, CPU architecture, operating system, default shell, and existence of locate database is automatically computed each execution but not stored.