VDBWRAIR / pathdiscov

Pathogen Discover Pipeline
1 stars 1 forks source link

Feature Request Version 4.2: SNAP integration #182

Closed mmelendrez closed 9 years ago

mmelendrez commented 9 years ago

Dr. Gustavo Palacios asked about our ability to integrate SNAP into version 4.2 along with DIAMOND. I don't remember correctly the conversation on this, or even if we did have a conversation with Michael and Jason about it - so we'll need to clarify with Michael and Jason the rationale for including or excluding this program in the pipeline before moving forward with this request. We can approach this after the Pathosphere release. I am just documenting the communication here.

demis001 commented 9 years ago

We talked about Diamond, and I am working on diamond implemrntation instead of blastx. I am currently testing it.

mmelendrez commented 9 years ago

Yes I remember DIAMOND being decided on for v4.2. But I don't remember SNAP. We'll clarify this in a future telecon.

bpfeffer commented 9 years ago

That was touched on about the same time during our telecon as Diamond, albeit briefly. SNAP is a host removal tool and it's much faster than bowtie2, so the thought was it could be used instead to speed up the process.

mmelendrez commented 9 years ago

Ah thanks Brad - now I do recall. Yes - we can discuss this for v4.2 as well. It's in as a feature request so Dereje and Tyghe can look into it.

demis001 commented 9 years ago
Too few parameters
Usage: 
snap paired <index-dir> <inputFile(s)> [<options>] where <input file(s)> is a list of files to process.

Options:
  -o   filename  output alignments to filename in SAM or BAM format, depending on the file extension or
       explicit type specifier (see below).  Use a dash with an explicit type specifier to write to
       stdout, so for example -o -sam - would write SAM output to stdout
  -d   maximum edit distance allowed per read or pair (default: 15)
  -n   number of seeds to use per read
  -sc  Seed coverage (i.e., readSize/seedSize).  Floating point.  Exclusive with -n.  (default: 0.000000)
  -h   maximum hits to consider per seed (default: 300)
  -ms  minimum seed matches per location (default: 1)
  -c   Deprecated parameter; this is ignored.  Consumes one extra arg.
  -a   Deprecated parameter; this is ignored.  Consumes one extra arg.
  -t   number of threads (default is one per core)
  -b   bind each thread to its processor (this is the default)
 --b   Don't bind each thread to its processor (note the double dash)
  -P   disables cache prefetching in the genome; may be helpful for machines
       with small caches or lots of cores/cache
  -so  sort output file by alignment location
  -sm  memory to use for sorting in Gb
  -x   explore some hits of overly popular seeds (useful for filtering)
  -f   stop on first match within edit distance limit (filtering mode)
  -F   filter output (a=aligned only, s=single hit only, u=unaligned only)
  -S   suppress additional processing (sorted BAM output only)
       i=index, d=duplicate marking
  -I   ignore IDs that don't match in the paired-end aligner
  -Cxx must be followed by two + or - symbols saying whether to clip low-quality
       bases from front and back of read respectively; default: back only (-C-+)
  -M   indicates that CIGAR strings in the generated SAM file should use M (alignment
       match) rather than = and X (sequence (mis-)match).  This is the default
  -=   use the new style CIGAR strings with = and X rather than M.  The opposite of -M
  -G   specify a gap penalty to use when generating CIGAR strings
  -pf  specify the name of a file to contain the run speed
  --hp Indicates not to use huge pages (this may speed up index load and slow down alignment)  This is the default
  -hp  Indicates to use huge pages (this may speed up alignment and slow down index load).
  -D   Specifies the extra search depth (the edit distance beyond the best hit that SNAP uses to compute MAPQ).  Default 2
  -rg  Specify the default read group if it is not specified in the input file
  -R   Specify the entire read group line for the SAM/BAM output.  This must include an ID tag.  If it doesn't start with
       '@RG' SNAP will add that.  Specify tabs by \t.  Two backslashes will generate a single backslash.
       backslash followed by anything else is illegal.  So, '-R @RG\tID:foo\tDS:my data' would generate reads
       with defualt tag foo, and an @RG line that also included the DS:my data field.
  -sa  Include reads from SAM or BAM files with the secondary (0x100) or supplementary (0x800) flag set; default is to drop them.
  -om  Output multiple alignments.  Takes as a parameter the maximum extra edit distance relative to the best alignment
       to allow for secondary alignments
 -omax Limit the number of alignments per read generated by -om.
  -pc  Preserve the soft clipping for reads coming from SAM or BAM files
  -xf  Increase expansion factor for BAM and GZ files (default 1.0)
  -hdp Use Hadoop-style prefixes (reporter:status:...) on error messages, and emit hadoop-style progress messages
  -mrl Specify the minimum read length to align, reads shorter than this (after clipping) stay unaligned.  This should be
       a good bit bigger than the seed length or you might get some questionable alignments.  Default 50
  -map Use file mapping to load the index rather than reading it.  This might speed up index loading in cases
       where SNAP is run repatedly on the same index, and the index is larger than half of the memory size
       of the machine.  On some operating systems, loading an index with -map is much slower than without if the
       index is not in memory.  You might consider adding -pre to prefetch the index into system cache when loading
       with -map when you don't expect the index to be in cache.
  -pre Prefetch the index into system cache.  This is only meaningful with -map, and only helps if the index is not
       already in memory and your operating system is slow at reading mapped files (i.e., some versions of Linux,
       but not Windows).
  -nu  No Ukkonen: don't reduce edit distance search based on prior candidates. This option is purely for
       evalutating the performance effect of using Ukkonen's algorithm rather than Smith-Waterman, and specifying
       it will slow down execution without improving the alignments.
  -no  No Ordering: don't order the evalutation of reads so as to select more likely candidates first.  This option
       is purely for evaluating the performance effect of the read evaluation order, and specifying it will slow
       down execution without improving alignments.
  -nt  Don't truncate searches based on missed seed hits.  This option is purely for evaluating the performance effect
       of candidate truncation, and specifying it will slow down execution without improving alignments.

You may process more than one alignment without restarting SNAP, and if possible without reloading
the index.  In order to do this, list on the command line all of the parameters for the first
alignment, followed by a comma (separated by a space from the other parameters) followed by the
parameters for the next alignment (including single or paired).  You may have as many of these
as you please.  If two consecutive alignments use the same index, it will not be reloaded.
So, for example, you could do 'snap single hg19-20 foo.fq -o foo.sam , paired hg19-20 end1.fq end2.fq -o paired.sam'
and it would not reload the index between the single and paired alignments.
When specifying an input or output file, you can simply list the filename, in which case
SNAP will infer the type of the file from the file extension (.sam or .bam for example),
or you can explicitly specify the file type by preceeding the filename with one of the
 following type specifiers (which are case sensitive):
    -fastq
    -compressedFastq
    -sam
    -bam
    -pairedFastq
    -pairedCompressedFastq
    -pairedInterleavedFastq
    -pairedCompressedInterleavedFastq

So, for example, you could specify -bam input.file to make SNAP treat input.file as a BAM file,
even though it would ordinarily assume a FASTQ file for input or a SAM file for output when it
doesn't recoginize the file extension.
In order to use a file name that begins with a '-' and not have SNAP treat it as a switch, you must
explicitly specify the type.  But really, that's just confusing and you shouldn't do it.
Input and output may also be from/to stdin/stdout. To do that, use a - for the input or output file
name and give an explicit type specifier.  So, for example, 
snap single myIndex -fastq - -o -sam -
would read FASTQ from stdin and write SAM to stdout.

  -s   min and max spacing to allow between paired ends (default: 50 1000).
  -fs  force spacing to lie between min and max.
  -H   max hits for intersecting aligner (default: 2000).
  -mcp specifies the maximum candidate pool size (An internal data structure. 
       Only increase this if you get an error message saying to do so. If you're running
       out of memory, you may want to reduce it.  Default: 1000000)
  -F b additional option to -F to require both mates to satisfy filter (default is just one)
       out of memory, you may want to reduce it.  Default: 544761188).
  -ku  Keep unpaired-looking reads in SAM/BAM input.  Ordinarily, if a read doesn't specify
       mate information (RNEXT field is * and/or PNEXT is 0) then the code that matches reads will immdeiately
       discard it.  Specifying this flag may cause large memory usage for some input files,
       but may be necessary for some strangely formatted input files.  You'll also need to specify this
       flag for SAM/BAM files that were aligned by a single-end aligner.
necrolyte2 commented 9 years ago

16GB RAM

$> snap paired /media/VD_Research/databases/humandna/hg38 testData/F.fastq testData/R.fastq -o out.sam
Welcome to SNAP version 1.0beta.17.

Loading index from directory... mmap: Cannot allocate memory
SNAP exited with exit code 1 from line 345 of file SNAPLib/BigAlloc.cpp

256GB RAM

snap paired /media/VD_Research/databases/humandna/hg38 testData/F.fastq testData/R.fastq -o out.sam
Welcome to SNAP version 1.0beta.17.

Loading index from directory... 337s.  3092848410 bases, seed size 20
MaxHits MaxDist %Used   %Unique %Multi  %!Found %Pairs  lvCalls NumReads    Reads/s
FASTQ using supplier queue
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
sched_setaffinity: Invalid argument
300 15  93.00%  0.43%   3.66%   97.42%  2.58%   99743   500 1120 (at: 415)

You can see that it took 337 seconds just to load the index when it finally had enough ram to use. I reran it and it only took 19 seconds to load(because Linux caches read/write)

First time running bowtie it only took 30 seconds to load and align, the second time I ran it it took 2.5 seconds.

$> time bowtie2 -q -x /media/VD_Research/databases/humandna/hg38 -1 testData/F.fastq -2 testData/R.fastq -S out.sam --local250 reads; of these:
  250 (100.00%) were paired; of these:
    199 (79.60%) aligned concordantly 0 times
    1 (0.40%) aligned concordantly exactly 1 time
    50 (20.00%) aligned concordantly >1 times
    ----
    199 pairs aligned concordantly 0 times; of these:
      0 (0.00%) aligned discordantly 1 time
    ----
    199 pairs aligned 0 times concordantly or discordantly; of these:
      398 mates make up the pairs; of these:
        390 (97.99%) aligned 0 times
        1 (0.25%) aligned exactly 1 time
        7 (1.76%) aligned >1 times
22.00% overall alignment rate

real    0m30.958s
user    0m1.221s
sys 0m2.154s

My initial feeling is that at this point, SNAP is not worth the trouble as just loading the index takes longer than bowtie2 takes to run(at least on the test data set)

demis001 commented 9 years ago

I do agree! I don't believe it runs faster than bowtie2. I will run some test run tomorrow too on the same data.

On Wed, Apr 1, 2015 at 4:32 PM, Tyghe Vallard notifications@github.com wrote:

16GB RAM

$> snap paired /media/VD_Research/databases/humandna/hg38 testData/F.fastq testData/R.fastq -o out.sam Welcome to SNAP version 1.0beta.17.

Loading index from directory... mmap: Cannot allocate memory SNAP exited with exit code 1 from line 345 of file SNAPLib/BigAlloc.cpp

256GB RAM

snap paired /media/VD_Research/databases/humandna/hg38 testData/F.fastq testData/R.fastq -o out.sam Welcome to SNAP version 1.0beta.17.

Loading index from directory... 337s. 3092848410 bases, seed size 20 MaxHits MaxDist %Used %Unique %Multi %!Found %Pairs lvCalls NumReads Reads/s FASTQ using supplier queue sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument sched_setaffinity: Invalid argument 300 15 93.00% 0.43% 3.66% 97.42% 2.58% 99743 500 1120 (at: 415)

You can see that it took 337 seconds just to load the index when it finally had enough ram to use. I reran it and it only took 19 seconds to load(because Linux caches read/write)

First time running bowtie it only took 30 seconds to load and align, the second time I ran it it took 2.5 seconds.

$> time bowtie2 -q -x /media/VD_Research/databases/humandna/hg38 -1 testData/F.fastq -2 testData/R.fastq -S out.sam --local250 reads; of these: 250 (100.00%) were paired; of these: 199 (79.60%) aligned concordantly 0 times 1 (0.40%) aligned concordantly exactly 1 time 50 (20.00%) aligned concordantly >1 times

199 pairs aligned concordantly 0 times; of these:
  0 (0.00%) aligned discordantly 1 time
----
199 pairs aligned 0 times concordantly or discordantly; of these:
  398 mates make up the pairs; of these:
    390 (97.99%) aligned 0 times
    1 (0.25%) aligned exactly 1 time
    7 (1.76%) aligned >1 times

22.00% overall alignment rate

real 0m30.958s user 0m1.221s sys 0m2.154s

My initial feeling is that at this point, SNAP is not worth the trouble as just loading the index takes longer than bowtie2 takes to run(at least on the test data set)

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/usamriidPathDiscov/issues/182#issuecomment-88623224 .

necrolyte2 commented 9 years ago

@demis001 if you want to request a session on biocuda so you have more memory to work with you can use this

qsub -l nodes=1:biocuda,walltime=01:00:00 -I
bpfeffer commented 9 years ago

How big are the sample fastqs you guys are using to compare these? It may be a situation where throwing a larger number of reads would show a better comparison since most of these things don't scale linearly. We usually have files that are 500,000+ reads(sometimes in the tens of millions).

necrolyte2 commented 9 years ago

Very small files. Tomorrow we will test with larger files. Regardless, a note will be placed in the documentation that you cannot even use Snap unless you have at least 60GB of memory

On Wed, Apr 1, 2015 at 5:21 PM, bpfeffer notifications@github.com wrote:

How big are the sample fastqs you guys are using to compare these? It may be a situation where throwing a larger number of reads would show a better comparison since most of these things don't scale linearly. We usually have files that are 500,000+ reads(sometimes in the tens of millions).

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/usamriidPathDiscov/issues/182#issuecomment-88635001 .

demis001 commented 9 years ago

Test data:

874M    7aS1058_S23_L001_R1_001.fastq
1.1G            7aS1058_S23_L001_R2_001.fastq

Snap use number of core by default ( -t 12) on my current system.

time snap paired  ~/databases/humandna/hg38 7aS1058_S23_L001_R1_001.fastq 7aS1058_S23_L001_R2_001.fastq -o out_snap.sam
Welcome to SNAP version 1.0beta.17.

Loading index from directory... 304s.  3092848410 bases, seed size 20
MaxHits MaxDist %Used   %Unique %Multi  %!Found %Pairs  lvCalls NumReads    Reads/s
FASTQ using supplier queue

300 15  99.58%  0.14%   0.01%   100.27% 0.10%   4896351 3513558 23638 (at: 148024)

real    7m35.451s
user    1m18.175s
sys 1m36.969s
demis001 commented 9 years ago

Bowtie2 test, passed -p 12 to tell bowtie2 to use 12 core similar to Snap

time bowtie2 -p 12 -q -x ~/databases/humandna/hg38 -1 7aS1058_S23_L001_R1_001.fastq -2 7aS1058_S23_L001_R2_001.fastq -S out_bowtie2.sam
1756779 reads; of these:
  1756779 (100.00%) were paired; of these:
    1756191 (99.97%) aligned concordantly 0 times
    491 (0.03%) aligned concordantly exactly 1 time
    97 (0.01%) aligned concordantly >1 times
    ----
    1756191 pairs aligned concordantly 0 times; of these:
      1424 (0.08%) aligned discordantly 1 time
    ----
    1754767 pairs aligned 0 times concordantly or discordantly; of these:
      3509534 mates make up the pairs; of these:
        3508336 (99.97%) aligned 0 times
        429 (0.01%) aligned exactly 1 time
        769 (0.02%) aligned >1 times
0.15% overall alignment rate

real    1m24.398s
user    4m27.328s
sys 2m52.841s
demis001 commented 9 years ago

Conclusions:

As you can see bowtie2 is much faster than Sanp on this test data. bowite2= 1m24.398s real time and Snap = 7m35.451s real time.

Snap is a hash-table based aligner. I even don't know how the hash based algorithm is 3-20x faster than FM Index (based on the Burrows-Wheeler) like bowtie2. Generally, hash table based aligners are slower than Burrows-Wheeler based aligners.

Solution: I recommend if we withhold implementing this until our next telecon

necrolyte2 commented 9 years ago

Snap is already integrated into host_map for v4.2-dev. Just need to merge pull request #204

On Thu, Apr 2, 2015 at 8:47 AM, Dereje Jima notifications@github.com wrote:

Conclusions:

As you can see bowtie2 is much faster than Sanp on this test data. bowite2= 1m24.398s real time and Snap = 7m35.451s real time.

Snap is a hash-table based aligner. I even don't know how the hash based algorithm is 3-20x faster than FM Index (based on the Burrows-Wheeler) like bowtie2. Generally, hash table based aligners are slower than Burrows-Wheeler based aligners.

Solution: I recommend if we withhold implementing this until our next telecon

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/usamriidPathDiscov/issues/182#issuecomment-88886135 .

demis001 commented 9 years ago

Thanks, I haven't seen it. You saved me time.

necrolyte2 commented 9 years ago

If you want to play around with just host_map you can have the pipeline just run it easily with this script. I created a bowtie2_param.txt and snap_param.txt param files that contained only the host map section. Then you can change the R1 and R2 manually(didn't get to the point where I made R1 and R2 arguments.

#!/bin/bash

aligner=$1
param="${aligner}_param.txt"
outdir="${aligner}_hostmap_only"

rm -rf $outdir
mkdir $outdir
pathogen.pl --sample $aligner --outputdir $outdir \
            --command host_map --paramfile $param \
            --R1 testData/F.fastq --R2 testData/R.fastq
necrolyte2 commented 9 years ago

So here are some results from some more rudimentary tests Input read files are 381,216 reads each for a total of 762,432 reads

Indexed hg38 for BWA, Bowtie2 and Snap

I ran each aligner and then ran it again right away just to check the difference of index load times These tests are done by just running the aligner by itself(not through the pipeline)

All times listed are in seconds BWA: 57.48, 41.73 Snap: 42.66, 54.83 Bowtie2: 32.89, 31.84

Additional extra information about building the index: The index was build using the instructions listed in the documentation for the pipeline with modifications such that bwa was also used to index

BWA took 118 minutes to index Snap took 24 minutes to index Bowtie2 took 193m to index

I have a pbs submission script that can easily run any data set so would be willing to run a larger dataset if we could somehow get it from you guys( @bpfeffer ) Additionally, I can hunt around for one of our 24 sample MiSeq runs which would have more data in each sample as well.

As it stands, bowtie is still faster than Snap for a 96 sample MiSeq run

bpfeffer commented 9 years ago

It does seem like SNAP isn't a good speedup for this pipeline the way we were hoping, due to the time it takes to load the massive index into memory each time. In a situation where the index is memory resident(such as cached and run a second time), it would theoretically been useful, but that's probably a very situational case.

While there do seem to be some parameters that can be thrown with snap to speedup the performance at a minor cost to sensitivity(things they use with snap in surpi like -f to stop at the first hit), the real time bottleneck was blast and that should be drastically improved with the implementation of Diamond. That index time is fairly impressive for snap at least, I can see if you're doing a lot of different indexing the speedup there alone for other pipelines or projects would be worth it, just not given the singular nature of it in this pipeline.

demis001 commented 9 years ago

Indexing a genome is a one time thing, it doesn't affect that much if you encounter performance difference in that aspect.

necrolyte2 commented 9 years ago

I just re-ran snap using -f

Snap: 33.11,39.62

It did get quite a bit closer to bowtie2 with that option. Not sure how it effects the output. I realized that also running the 2nd time isn't really giving us a great indication of the difference of cached vs uncached as the job scheduler on the HPC was allocating the same node over and over so the index was already cached

So true, cached vs uncached for snap is

Snap: 67.88,41.80

necrolyte2 commented 9 years ago

So I guess, regardless of if Snap is faster or not for these tests, it is still an option that can be used. The code is there now to support snap, we will just put a note in the documentation that says something about bowtie2 probably being faster for most cases unless your dataset is very large(but how large is "very large?")

bpfeffer commented 9 years ago

Tabling Snap right now seems to be the best option, based on all the testing you guys have done. We'll do a bit more research into their extraordinary claims of speedup and try to figure out where they are getting their numbers. Thanks!