Closed blahah closed 9 years ago
Can you send me an example?
And I hope I never meet the person who wrote the fastq spec.😊
Sent from my Windows Phone
From: Richard Smith-Unnamailto:notifications@github.com Sent: ‎1/‎16/‎2015 8:33 AM To: amplab/snapmailto:snap@noreply.github.com Subject: [snap] improper parsing of FASTQ records (#46)
I have some FASTQ files which have a lot of records where the first base has a 0 quality score, represented as @. When I run SNAP with these files as input, I get an error that reads are unpaired. However, there are no unpaired reads. The alignments works fine, i.e. it doesn't throw the unpaired read error, if I replace all the @s at the beginning of the quality lines of the records as follows:
sed -E 's/^@([^r])/A/g' random.left.fq > random.left.cleaned.fq sed -E 's/^@([^r])/A/g' random.right.fq > random.right.cleaned.fq
— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/46.
The FASTQ files are uploaded here: https://drive.google.com/folderview?id=0B6ChGXuXmOEDflRVaDFYTkZsOWV2c3dFcW5acGVGOFB6RURVbjkxMVB6V09FejZTVzNzZkE&usp=sharing
I think the reason you'll never meet them is the problem - there is no standard spec! It's all been reverse-engineered and gleaned from snippets of information given out by the various manufacturers of sequencing machines, and each machine and firmware version changes the spec. Format hell at its finest.
I noticed there was a problem with my sed command above - it actually replaced the @ and the following character with A, so it was reducing the length of the quality string by 1. This (for some reason) made SNAP skip the read and its pair, no longer throwing the error. When I realised my mistake, I found that it was actually lower case 'n' in the sequences that was causing the unpaired error. Fixed in a PR.
I think that lower case bases in input should be working now. And having quality and SEQ strings be different length is legitimately a format error, so SNAP should reject it.
I have some FASTQ files which have a lot of records where the first base has a 0 quality score, represented as
@
(a perfectly valid zero in phred+64). When I run SNAP with these files as input, I get an error that reads are unpaired. However, there are no unpaired reads. The alignment works fine, i.e. it doesn't throw the unpaired read error, if I replace all the@
s at the beginning of the quality lines of the records as follows:Thus the FASTQ parser seems to be improperly parsing some records when the quality line starts with a
@
.The un-cleaned files are here.