ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
721 stars 131 forks source link

Bayehammer FASTQs do not obey FASTQ spec #414

Open ijhoskins opened 4 years ago

ijhoskins commented 4 years ago

I recently created an issue: https://github.com/ablab/spades/issues/413

That I closed because I thought the problem was tabs in the seqname. I didn't see the full error before and it turns out the issue is actually that the Bayeshammer FASTQs do not strictly have the same seqname before the sequence and the quals:

@COOPER:281:H2HJ3BBXY:6:1101:1184:44377 RG:Z:CBS1_31_S10_L006_R BH:changed:7
GCTTCAGTGGGCACGGGCGGCACCATCACTGGCATTGCCAGGAAGCTGAAGGAGAAGTGTCCTGGATGCACGATCATTAGGGTGCATCCCGAAGGGTCCATCCTCGCAGAGCCCGAAGAGCTGAACCAGACGCAGCAGACAACCTAC
+COOPER:281:H2HJ3BBXY:6:1101:1184:44377 RG:Z:CBS1_31_S10_L006_R BH:changed:7
SSSSSSSSSSSSSSSSSSJFJJJFJ-FFJ#AFJJJ7-FJJF<JF<FFFJFJJJJ7<FJ7AJ<A7AJ--AF-A--A-77#A-7-7##FFF#F#J<<-77FF7AJ7J)AJFA)A<-7F)7--7--A#7<<7-7)SSSSSSSSSSSSSSS

@COOPER:281:H2HJ3BBXY:6:1101:1184:44377 RG:Z:CBS1_31_S10_L006_R BH:changed:6
CCTCGCAGGTTGTCTGCTCCGTCTGGTTCAGCCCCTCCCCCTCTGCGAGGATGGACCCTTCGGGATCCACCCCAATGATCCTGCATCCAGGACACTCCCCCTTCACCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+COOPER:281:H2HJ3BBXY:6:1101:1184:44377 RG:Z:CBS1_31_S10_L006_R BH:changed:6 rtrim=35
SSSSSSSSSSSSSSSSSSSFA#A---<--F-#-7--7#7-7-F#7FA--77<--AFAF7-F-7-7<FFAJ--AAJ<-<-AA-----7A-7--77--77-AJ--7<-FBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Note that the R2 sequence (second) does not have the same seqname repeated. This is in violation of the FASTQ spec: http://maq.sourceforge.net/fastq.shtml

ijhoskins commented 4 years ago

I note that I ran Bayeshammer on FASTQs where I had edited the quals at the termini of reads to be the same. It is possible the issue arises with this type of data, as I never encountered this same error when running on natural FASTQs.

ijhoskins commented 4 years ago

For completeness, here are the FASTQ records before input to Bayeshammer:

@COOPER:281:H2HJ3BBXY:6:1101:1184:44377 RG:Z:CBS1_31_S10_L006_R
GCTTCAGTGGGCACGGGCGGCACCATCACNGGCATTGCCAGGAAGCTGAAGGAGAAGTGTCCTGGATGCACGATCATTNGGGTGNNTCCNGNAGGGTCCATCCTCGCAGAGCCCGAAGAGCTGANCCAGACGCAGCAGACAACCTAC
+
SSSSSSSSSSSSSSSSSSJFJJJFJ-FFJ#AFJJJ7-FJJF<JF<FFFJFJJJJ7<FJ7AJ<A7AJ--AF-A--A-77#A-7-7##FFF#F#J<<-77FF7AJ7J)AJFA)A<-7F)7--7--A#7<<7-7)SSSSSSSSSSSSSSS

@COOPER:281:H2HJ3BBXY:6:1101:1184:44377 RG:Z:CBS1_31_S10_L006_R
CCTCGCAGGTTGTCTNCTCCGNCTGGTTCAGNCCCTCNCCCTCNGCGAGGATGGACCCTTCGGGATCCACCCCAATGATCCTGCATCCAGGACACTCCCCCTCCACCCTCCTGGCAACCCCCGCCACAGCCCCCCCCGCCCC
+
SSSSSSSSSSSSSSSSSSSFA#A---<--F-#-7--7#7-7-F#7FA--77<--AFAF7-F-7-7<FFAJ--AAJ<-<-AA-----7A-7--77--77-AJ--7<-F--7A---7----7A7----7--7--A-)7A---AA
ijhoskins commented 4 years ago

@andrewprzh this seems like a fatal error for users running Bayeshammer only. Can you reproduce? Let me know if I can provide any files.