Closed taltman closed 3 years ago
Snap reads the fasta twice, which is probably why this doesn't work.
Sent from my Windows Phone
From: Tomer Altmanmailto:notifications@github.com Sent: 1/28/2017 2:48 PM To: amplab/snapmailto:snap@noreply.github.com Cc: Subscribedmailto:subscribed@noreply.github.com Subject: [amplab/snap] Odd output when building index from zcat command substitution (#79)
To avoid decompressing the large nt.gz resource (~34 GB) from NCBI, I used Bash command substitution to put 'zcat nt.gz' instead of a FASTA file name. The problem is that the resulting database is 320K, which is surely an error. Command line and output below:
time nice -n 10 snap-aligner index <(zcat /dev/shm/taltman/nt.gz) /dev/shm/taltman/snap_db \ -t60 \ -B| \ -large \
Welcome to SNAP version 1.0beta.23.
Hash table slack 0.300000 Loading FASTA file '/dev/fd/63' into memory...1805s Saving genome...0s Computing bias table. Bias computation: 0 / 500 Computed bias table in 0s Allocating memory for hash tables...0s Building hash tables. 0(0%) seeds occur more than once, total of 0(0%) genome locations are not unique, 0(0%) bad seeds, 0 both complements used 479 no string Hash table build took 12s Building overflow table. Overflow table build and hash table save took 0s Saving overflow table...0s Index build and save took 1817s (0 bases/s)
real 30m17.463s user 23m48.651s sys 9m30.990s
taltman$ ls -lh /dev/shm/taltman/snap_db/ total 320K -rw-r--r-- 1 taltman1 microbio 506 Jan 28 14:33 Genome -rw-r--r-- 1 taltman1 microbio 29 Jan 28 14:33 GenomeIndex -rw-r--r-- 1 taltman1 microbio 309K Jan 28 14:33 GenomeIndexHash -rw-r--r-- 1 taltman1 microbio 0 Jan 28 14:33 OverflowTable
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F79&data=02%7C01%7Cbolosky%40microsoft.com%7Cfa2c8fc47b1a42ae254508d447cfbca5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212404926016233&sdata=K5%2Fhz%2B9r4ECRnEhIGvAxt7brOIvOW5fg5hhk3NnXVr0%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752fLgltiHGtrZq6a9OnujxBPrsC7xks5rW8WogaJpZM4LwpMC&data=02%7C01%7Cbolosky%40microsoft.com%7Cfa2c8fc47b1a42ae254508d447cfbca5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212404926016233&sdata=%2F%2Fn5lqfKxSYFVvzzmyG4cEGe0QRiizdKBks1oQoQZDY%3D&reserved=0.
Thanks for your fast reply!
I suspected something like that was happening, where streaming input wouldn't suffice. I'd recommend putting some smarts into SNAP to detect file descriptors, and inform the user that the trick won't work. At the very least, some documentation about this restriction when running 'snap-aligner index --help' would be great. Thanks!
I should just fix it properly and only read the file once. It was a shortcut to do what I did, but it’s inelegant, makes it impossible to read from streams like you tried, and is slow.
From: Tomer Altman [mailto:notifications@github.com] Sent: Saturday, January 28, 2017 3:07 PM To: amplab/snap snap@noreply.github.com Cc: Bill Bolosky bolosky@microsoft.com; Comment comment@noreply.github.com Subject: Re: [amplab/snap] Odd output when building index from zcat command substitution (#79)
Thanks for your fast reply!
I suspected something like that was happening, where streaming input wouldn't suffice. I'd recommend putting some smarts into SNAP to detect file descriptors, and inform the user that the trick won't work. At the very least, some documentation about this restriction when running 'snap-aligner index --help' would be great. Thanks!
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F79%23issuecomment-275881240&data=02%7C01%7Cbolosky%40microsoft.com%7C9cf0bab5fcbf4712587608d447d25da6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212416221542013&sdata=l6sgScPiSF3TyItOuWl3MdEGKCiNl1WOGl050efx4%2Bg%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752WLM7qKCdCjadR0sOvrUSlM-Tdb2ks5rW8oSgaJpZM4LwpMC&data=02%7C01%7Cbolosky%40microsoft.com%7C9cf0bab5fcbf4712587608d447d25da6%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212416221542013&sdata=naWfqEQkTcDbSoOStfjS%2Fgl916mqUl6yupdMLyVsIzw%3D&reserved=0.
I found a similar issue when mapping with bzcat substituition. When running the following:
snap-aligner paired ~/rsc/hg38_snap -fastq \
<(bzcat ../raw/ERR019684_1.fastq.bz2) \
<(bzcat ERR019684_1.fastq.bz2) \
-o ERR019684.bam
I got the following result:
Welcome to SNAP version 1.0beta.18.
Loading index from directory... 1101s. 3219030417 bases, seed size 20
Aligning.
Total Reads Aligned, MAPQ >= 10 Aligned, MAPQ < 10 Unaligned Too Short/Too Many Ns %Pairs Reads/s Time in Aligner (s)
0 0 (-nan%) 0 (-nan%) 0 (-nan%) 0 (0.00%) -nan%% 0 0
making look like it didn't read any read. I think its a similar issue on the processing.
This is fixed in 1.0. It not only reads the index only once (which speeds up index build), but it also avoids calling GetFileSize, which doesn't do the right thing on a pipe.
To avoid decompressing the large nt.gz resource (~34 GB) from NCBI, I used Bash command substitution to put 'zcat nt.gz' instead of a FASTA file name. The problem is that the resulting database is 320K, which is surely an error. Command line and output below: