For more information see wiki
PARSES is intended to be executed with Solexa data on a *NIX-based desktop computer. It may not be used for any sort of financial gain. Licenses for both MEGAN and Novoalign are strictly for non-profit research at non-profit institutions and academic usage.
PARSES's system requirements are directly dependent on the size of the data set being processed. It is recommended to be run on a Linux or OS X 64-bit machine with at least 4GBs of memory. You must also have root privileges to the machine if you are installing software.
Execute all commands from the directory of the data. Place all scripts for PARSES into a single directory and mark as executable. All latest versions of programs will be installed, in the event of an error during installation a repository of the programs may be used which is not guaranteed to be up to date. The repository can be activated by including the repo=true command. Installation will automatically be performed during any execution but it can be manually performed by evoking any of the following installation commands:
sudo rake -f /rake/file/location install #installs and indexes all resources.
sudo rake -f /rake/file/location novoalignInstall
sudo rake -f /rake/file/location bowtieInstall
sudo rake -f /rake/file/location hgInstall
sudo rake -f /rake/file/location novoIndex
sudo rake -f /rake/file/location bowtieIndex
sudo rake -f /rake/file/location samtoolsInstall
sudo rake -f /rake/file/location tophatInstall
sudo rake -f /rake/file/location abyssInstall
sudo rake -f /rake/file/location blastInstall
sudo rake -f /rake/file/location ntInstall
sudo rake -f /rake/file/location meganInstall
sudo rake -f /rake/file/location parallelIteratorInstall
Example executions.
rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3
rake -f /rake/file/location seq=NameYouGiveToYourSequence
rake -f /rake/file/location repo=true install
rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3 localAlignContigs
rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourSequenceFileName.fastq type=illumina1.3 truncate=true localAlignContigs
rake -f /rake/file/location seq=NameYouGiveToYourSequence file=YourFileToProcess type=illumina1.3 truncate=true forcefile=true localAlignContigs
rake -f /rake/file/location -T
datafile.fastq
datafile.fastq.novo
datafile.fastq.novo.NM.fasta
datafile.fastq.novo.NM.fasta.nospans
datafile.fastq.novo.NM.fasta.nospans.blast
datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma
datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa
datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast
datafile.fastq.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast.megan.rma
The primary method for viewing results involves perusing the RMA file produced by MEGAN--though a PDF is also produced with standard LCA parameters. To familiarize yourself with MEGAN you can watch the MEGAN Introduction Youtube video.
Your results will resemble the following:
For a detailed example of what results changing LCA parameters will have click the image below (or here for a PDF version).
Reporting is currently stored in logs.
A log file is produced in the data directory with the filename sequenceName.log
. It not only logs all commands executed, in addition to errors returned, but it also contains report information on the results of the data analysis. Below is a list of all information logged at each step of PARSES.
Below is an example of a log file:
# Logfile created on Mon Nov 29 20:21:37 -0600 2010 by logger.rb/22285
I, [2010-12-27T14:54:16.761320 #94508] INFO -- : Begin run for seq=akata file=s_4_sequence_Akata.txt type=illumina1.3 task=-f
I, [2010-12-27T14:54:16.849927 #94508] INFO -- : Executing PARSES v0.30 in a 0 environment with 64GB of memory and 24 cores with a 64-bit architecture.
I, [2010-12-27T17:56:20.865114 #94508] INFO -- : tophat -p 24 --solexa1.3-quals --output-dir akata_tophat_out /usr/share/hgChrAll s_4_sequence_Akata.txt
I, [2010-12-27T17:56:20.865442 #94508] INFO -- : samtools view -h -o akata_tophat_out/accepted_hits.sam akata_tophat_out/accepted_hits.bam
I, [2010-12-27T17:56:20.865475 #94508] INFO -- : Xextractspans.pl akata_tophat_out/accepted_hits.sam
I, [2010-12-27T17:56:20.865506 #94508] INFO -- : Xfilterspans.pl s_4_sequence_Akata.txt.novo.NM.fasta akata_tophat_out/accepted_hits.sam.spans
A configuration file is produced for PARSES in $HOME/.PARSES
. It contains the paths to the human genome and NT databases as well as the paths to the Bowtie and TopHat indices. It has the following form:
--- !ruby/object:Settings
bowtieIndex: /usr/share/hgChrAll
humanGenomeDatabase: /usr/share
novoIndex: /usr/share/hgChrAll.ndx
ntDatabase: /usr/share/nt/nt
In addition, configuation files are produced for each sequence in $(pwd)/.sequenceName
. Settings for each program are chosen by default but can be changed via the sequence file which has the following form:
--- !ruby/object:Sequence
abyssPath: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma
abyssPathGlob: reads-*.fasta
blast1Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans
blast2Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa
blastOutputFormat: 0
blastPathGlob: reads-*.fasta.*.kmer.contigs.fa.kmerOptimized.fa
dataType: illumina1.3
eValue1: 1.0e-06
eValue2: 100
expansionNumber: 10
filePath: s_4_sequence_Akata.txt
imageFileType: jpg
maxKmerLength: 38
maxMatches: 0
megan1Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast
megan2Path: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast
minKmerLength: 7
minScoreByLength: 0
minSupport: 5
novoalignPath: s_4_sequence_Akata.txt.novo
pipeEndPath: s_4_sequence_Akata.txt.novo.NM.fasta.nospans.blast.megan.rma.kmerOptimized.fa.blast.megan.rma
readLength: 38
removeNonMappedPath: s_4_sequence_Akata.txt.novo.NM.fasta
topPercent: 10.0
useCogs: "false"
useGos: "false"
winScore: 0.0
It is recommended the path variables not be adjusted. Everything except expansionNumber, which is the number of times to execute the expansion command when generating a picture from MEGAN is straight-forward.
In addition, the amount of RAM, number of CPU cores, CPU architecture, operating system, default shell, and existence of locate database is automatically computed each execution but not stored.