amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

Odd output when building index from zcat command substitution #79

Closed taltman closed 3 years ago

taltman commented 7 years ago

To avoid decompressing the large nt.gz resource (~34 GB) from NCBI, I used Bash command substitution to put 'zcat nt.gz' instead of a FASTA file name. The problem is that the resulting database is 320K, which is surely an error. Command line and output below:

time nice -n 10 snap-aligner index <(zcat /dev/shm/taltman/nt.gz) /dev/shm/taltman/snap_db \
                                                -t60 \
                                                -B\| \
                                                -large \

Welcome to SNAP version 1.0beta.23.

Hash table slack 0.300000
Loading FASTA file '/dev/fd/63' into memory...1805s
Saving genome...0s
Computing bias table.
Bias computation: 0 / 500
Computed bias table in 0s
Allocating memory for hash tables...0s
Building hash tables.
0(0%) seeds occur more than once, total of 0(0%) genome locations are not unique, 0(0%) bad seeds, 0 both complements used 479 no string
Hash table build took 12s
Building overflow table.
Overflow table build and hash table save took 0s
Saving overflow table...0s
Index build and save took 1817s (0 bases/s)

real    30m17.463s
user    23m48.651s
sys     9m30.990s

taltman$ ls -lh /dev/shm/taltman/snap_db/
total 320K
-rw-r--r-- 1 taltman taltman  506 Jan 28 14:33 Genome
-rw-r--r-- 1 taltman taltman   29 Jan 28 14:33 GenomeIndex
-rw-r--r-- 1 taltman taltman 309K Jan 28 14:33 GenomeIndexHash
-rw-r--r-- 1 taltman taltman    0 Jan 28 14:33 OverflowTable
bolosky commented 7 years ago

Snap reads the fasta twice, which is probably why this doesn't work.

Sent from my Windows Phone


From: Tomer Altmanmailto:notifications@github.com Sent: ‎1/‎28/‎2017 2:48 PM To: amplab/snapmailto:snap@noreply.github.com Cc: Subscribedmailto:subscribed@noreply.github.com Subject: [amplab/snap] Odd output when building index from zcat command substitution (#79)

To avoid decompressing the large nt.gz resource (~34 GB) from NCBI, I used Bash command substitution to put 'zcat nt.gz' instead of a FASTA file name. The problem is that the resulting database is 320K, which is surely an error. Command line and output below:

time nice -n 10 snap-aligner index <(zcat /dev/shm/taltman/nt.gz) /dev/shm/taltman/snap_db \ -t60 \ -B| \ -large \

Welcome to SNAP version 1.0beta.23.

Hash table slack 0.300000 Loading FASTA file '/dev/fd/63' into memory...1805s Saving genome...0s Computing bias table. Bias computation: 0 / 500 Computed bias table in 0s Allocating memory for hash tables...0s Building hash tables. 0(0%) seeds occur more than once, total of 0(0%) genome locations are not unique, 0(0%) bad seeds, 0 both complements used 479 no string Hash table build took 12s Building overflow table. Overflow table build and hash table save took 0s Saving overflow table...0s Index build and save took 1817s (0 bases/s)

real 30m17.463s user 23m48.651s sys 9m30.990s

taltman$ ls -lh /dev/shm/taltman/snap_db/ total 320K -rw-r--r-- 1 taltman1 microbio 506 Jan 28 14:33 Genome -rw-r--r-- 1 taltman1 microbio 29 Jan 28 14:33 GenomeIndex -rw-r--r-- 1 taltman1 microbio 309K Jan 28 14:33 GenomeIndexHash -rw-r--r-- 1 taltman1 microbio 0 Jan 28 14:33 OverflowTable

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F79&data=02%7C01%7Cbolosky%40microsoft.com%7Cfa2c8fc47b1a42ae254508d447cfbca5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212404926016233&sdata=K5%2Fhz%2B9r4ECRnEhIGvAxt7brOIvOW5fg5hhk3NnXVr0%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752fLgltiHGtrZq6a9OnujxBPrsC7xks5rW8WogaJpZM4LwpMC&data=02%7C01%7Cbolosky%40microsoft.com%7Cfa2c8fc47b1a42ae254508d447cfbca5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212404926016233&sdata=%2F%2Fn5lqfKxSYFVvzzmyG4cEGe0QRiizdKBks1oQoQZDY%3D&reserved=0.

taltman commented 7 years ago

Thanks for your fast reply!

I suspected something like that was happening, where streaming input wouldn't suffice. I'd recommend putting some smarts into SNAP to detect file descriptors, and inform the user that the trick won't work. At the very least, some documentation about this restriction when running 'snap-aligner index --help' would be great. Thanks!

bolosky commented 7 years ago

I should just fix it properly and only read the file once. It was a shortcut to do what I did, but it’s inelegant, makes it impossible to read from streams like you tried, and is slow.

From: Tomer Altman [mailto:notifications@github.com] Sent: Saturday, January 28, 2017 3:07 PM To: amplab/snap snap@noreply.github.com Cc: Bill Bolosky bolosky@microsoft.com; Comment comment@noreply.github.com Subject: Re: [amplab/snap] Odd output when building index from zcat command substitution (#79)

Thanks for your fast reply!

I suspected something like that was happening, where streaming input wouldn't suffice. I'd recommend putting some smarts into SNAP to detect file descriptors, and inform the user that the trick won't work. At the very least, some documentation about this restriction when running 'snap-aligner index --help' would be great. Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F79%23issuecomment-275881240&data=02%7C01%7Cbolosky%40microsoft.com%7C9cf0bab5fcbf4712587608d447d25da6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212416221542013&sdata=l6sgScPiSF3TyItOuWl3MdEGKCiNl1WOGl050efx4%2Bg%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752WLM7qKCdCjadR0sOvrUSlM-Tdb2ks5rW8oSgaJpZM4LwpMC&data=02%7C01%7Cbolosky%40microsoft.com%7C9cf0bab5fcbf4712587608d447d25da6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212416221542013&sdata=naWfqEQkTcDbSoOStfjS%2Fgl916mqUl6yupdMLyVsIzw%3D&reserved=0.

andremrsantos commented 6 years ago

I found a similar issue when mapping with bzcat substituition. When running the following:

snap-aligner paired ~/rsc/hg38_snap -fastq \
    <(bzcat ../raw/ERR019684_1.fastq.bz2) \
    <(bzcat ERR019684_1.fastq.bz2) \
    -o ERR019684.bam

I got the following result:

Welcome to SNAP version 1.0beta.18.

Loading index from directory... 1101s.  3219030417 bases, seed size 20
Aligning.
Total Reads    Aligned, MAPQ >= 10    Aligned, MAPQ < 10     Unaligned              Too Short/Too Many Ns  %Pairs   Reads/s   Time in Aligner (s)
0              0 (-nan%)              0 (-nan%)              0 (-nan%)              0 (0.00%)              -nan%%   0         0

making look like it didn't read any read. I think its a similar issue on the processing.

bolosky commented 3 years ago

This is fixed in 1.0. It not only reads the index only once (which speeds up index build), but it also avoids calling GetFileSize, which doesn't do the right thing on a pipe.