amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
288 stars 66 forks source link

Streaming input causing FASTQ parsing errors due to incorrect line breaks #73

Closed chapmanb closed 8 years ago

chapmanb commented 8 years ago

Hi all; I've been using SNAP with streaming input where I create an interleaved fastq input and fix errors in the fastq files using awk (adding /1 and /2 to read names, removing spaces, things like that). As a result the input goes streamed into SNAP. We regularly see error messages that indicate the input gets truncated not at line endings, some examples:

FASTQ file - has invalid starting character at offset 962920112, line type 0, char 9
Line in question: '9929.102677888_HWI-ST211R_330:5:2106:18153:132057/2'
SNAP exited with exit code 1 from line 267 of file SNAPLib/FASTQ.cpp

FASTQ file - has invalid starting character at offset 2080744180, line type 0, char C
Line in question: 'CTTGATAAGGATTGGGGCTGGGGGGTTTCCTTAGGGACGACCTGGCCCAGCTGCCCTTCCTGACCATGTGCATTAAGGAGAGCCTG'
SNAP exited with exit code 1 from line 267 of file SNAPLib/FASTQ.cpp

FASTQ file - has invalid starting character at offset 598685188, line type 0, char G
Line in question: 'GATTTGAAGGTCTGATGATGCCACATTAGGAGGCGGGCGG'
SNAP exited with exit code 1 from line 267 of file SNAPLib/FASTQ.cpp

I tried to reproduce a minimal example but can't seem to manage to cause the error without larger inputs. Roughly, I'm running the analysis like:

cat input-interleaved.fastq | snap-aligner paired $REFDIR -pairedInterleavedFastq - -t 8 -M -o -sam - > /dev/null

I've tried various buffering workarounds on the command line with minimal success (sometimes it fixes it, sometimes not) but am a bit stuck at how best to proceed. Is there anything that could get fixed on the SNAP side to handle streaming inputs?

Thanks for taking a look and happy to provide any other information that would help.

rnpandya commented 8 years ago

Fixed in 1.0dev.97