I think I've discovered an issue with the implementation of the QualityEncoding that could affect the handling of quality scores in the Solexa/Illumina 1.0 format.
QualityEncoding doesn't support negative quality scores:
The Solexa/Illumina 1.0 format uses ASCII 59 through 126 to represent scores between -5 and 62. However, the current QualityEncoding constructor throws an error when the low end of the quality encoding range is less than the offset: This can be seen in src/FASTQ/quality.jl:
elseif low < offset
error("Low end of in quality encoding range cannot be less than offset")
This may lead to errors when reading Solexa/Illumina 1.0 files with negative quality scores. I suggest removing these two lines, as it's an unnecessary contraint that only applies when creating a QualityEncoding, and actually prevents us from decoding certain formats properly and creating encodings for them.
Incorrect first character in SOLEXA_QUAL_ENCODING:
I also noticed that the first character for Solexa encoding is ASCII 64 '@', when it should be ASCII 59 ';'. This can be seen further down in src/FASTQ/quality.jl:
If the ';' character was used, the current version would've thrown that low < offset error.
Examples
Current behavior:
julia> FASTQ.decode_quality(QualityEncoding('@':'~', 64), Int(';')) # current built-in Solexa encoding breaks
ERROR: Quality 59 not in encoding range 64:126
julia> FASTQ.decode_quality(QualityEncoding(';':'~', 64), Int(';')) # current constructor doesn't allow creating the correct encoding
ERROR: Low end of in quality encoding range cannot be less than offset
julia> collect(quality_scores(FASTQRecord("","ACGT",";?@A"), :solexa))
ERROR: Quality 59 not in encoding range 64:126
After removing the low < offset check and changing the first character of the Solexa encoding:
Helloo!
I think I've discovered an issue with the implementation of the QualityEncoding that could affect the handling of quality scores in the Solexa/Illumina 1.0 format.
QualityEncoding doesn't support negative quality scores:
The Solexa/Illumina 1.0 format uses ASCII 59 through 126 to represent scores between -5 and 62. However, the current QualityEncoding constructor throws an error when the low end of the quality encoding range is less than the offset: This can be seen in src/FASTQ/quality.jl:
This may lead to errors when reading Solexa/Illumina 1.0 files with negative quality scores. I suggest removing these two lines, as it's an unnecessary contraint that only applies when creating a QualityEncoding, and actually prevents us from decoding certain formats properly and creating encodings for them.
Incorrect first character in SOLEXA_QUAL_ENCODING:
I also noticed that the first character for Solexa encoding is ASCII 64 '@', when it should be ASCII 59 ';'. This can be seen further down in src/FASTQ/quality.jl:
If the ';' character was used, the current version would've thrown that
low < offset
error.Examples
Current behavior:
After removing the
low < offset
check and changing the first character of the Solexa encoding:Cheers!