amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
288 stars 66 forks source link

Compressed Input from process substitution (gzip or bz2) fails #31

Closed binarybana closed 9 years ago

binarybana commented 10 years ago

Using 1b12 (git master) I tried:

$ snap-dev single /mnt/datab/refs/grcm38/SNAP-bspace -fastq <(lbzcat xaa.fastq.bz2) -map -o xaa.bam
Welcome to SNAP version 1.0beta.12.

Loading index from directory... 5s.  2731034030 bases, seed size 20
MaxHits MaxDist %Used   %Unique %Multi  %!Found %Pairs  lvCalls NumReads        Reads/s
300     14      100.00% 0.00%   0.00%   0.00%   0.00%   0       0       200 (at: 5)

Whereas using the uncompressed version:

$ snap-dev single /mnt/datab/refs/grcm38/SNAP-bspace xaa.fastq -map -o xaa.bam
Welcome to SNAP version 1.0beta.12.

Loading index from directory... 5s.  2731034030 bases, seed size 20
MaxHits MaxDist %Used   %Unique %Multi  %!Found %Pairs  lvCalls NumReads        Reads/s
300     14      99.60%  77.51%  20.48%  2.01%   0.00%   0       250     8032 (at: 31)

I've uploaded the test file here and the reference is just the mouse Ensembl grcm38 with 96 ERCC transcripts added on at the end (each as a separate chromosome).

Let me know what else I can do to help.

bolosky commented 9 years ago

snap's not going to do well with what looks like a file, but into which it can't get the size or seek. The right way to do this is to pipe the output of lbzcat into snap, and then tell it to run from stdin:

lbzcat xaa.fastq.bz2 | snap-dev single /mnt/datab/refs/grcm38/SNAP-bspace -fastq - -map -o xaa.bam

A better question is why we don't have native decompress support for bz2, so you could just use it on the command line directly like with gzip.