amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
288 stars 66 forks source link

improper parsing of FASTQ records #46

Closed blahah closed 9 years ago

blahah commented 9 years ago

I have some FASTQ files which have a lot of records where the first base has a 0 quality score, represented as @ (a perfectly valid zero in phred+64). When I run SNAP with these files as input, I get an error that reads are unpaired. However, there are no unpaired reads. The alignment works fine, i.e. it doesn't throw the unpaired read error, if I replace all the @s at the beginning of the quality lines of the records as follows:

sed -E 's/^@([^r])/A/g' random.left.fq > random.left.cleaned.fq
sed -E 's/^@([^r])/A/g' random.right.fq > random.right.cleaned.fq

Thus the FASTQ parser seems to be improperly parsing some records when the quality line starts with a @.

The un-cleaned files are here.

bolosky commented 9 years ago

Can you send me an example?

And I hope I never meet the person who wrote the fastq spec.😊

Sent from my Windows Phone


From: Richard Smith-Unnamailto:notifications@github.com Sent: ‎1/‎16/‎2015 8:33 AM To: amplab/snapmailto:snap@noreply.github.com Subject: [snap] improper parsing of FASTQ records (#46)

I have some FASTQ files which have a lot of records where the first base has a 0 quality score, represented as @. When I run SNAP with these files as input, I get an error that reads are unpaired. However, there are no unpaired reads. The alignments works fine, i.e. it doesn't throw the unpaired read error, if I replace all the @s at the beginning of the quality lines of the records as follows:

sed -E 's/^@([^r])/A/g' random.left.fq > random.left.cleaned.fq sed -E 's/^@([^r])/A/g' random.right.fq > random.right.cleaned.fq

— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/46.

blahah commented 9 years ago

The FASTQ files are uploaded here: https://drive.google.com/folderview?id=0B6ChGXuXmOEDflRVaDFYTkZsOWV2c3dFcW5acGVGOFB6RURVbjkxMVB6V09FejZTVzNzZkE&usp=sharing

I think the reason you'll never meet them is the problem - there is no standard spec! It's all been reverse-engineered and gleaned from snippets of information given out by the various manufacturers of sequencing machines, and each machine and firmware version changes the spec. Format hell at its finest.

blahah commented 9 years ago

I noticed there was a problem with my sed command above - it actually replaced the @ and the following character with A, so it was reducing the length of the quality string by 1. This (for some reason) made SNAP skip the read and its pair, no longer throwing the error. When I realised my mistake, I found that it was actually lower case 'n' in the sequences that was causing the unpaired error. Fixed in a PR.

bolosky commented 9 years ago

I think that lower case bases in input should be working now. And having quality and SEQ strings be different length is legitimately a format error, so SNAP should reject it.