BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.
https://biojulia.dev
MIT License
61 stars 20 forks source link

Issue with QualityEncoding for the Solexa format #104

Closed AntonOresten closed 1 year ago

AntonOresten commented 1 year ago

Helloo!

I think I've discovered an issue with the implementation of the QualityEncoding that could affect the handling of quality scores in the Solexa/Illumina 1.0 format.

QualityEncoding doesn't support negative quality scores:

The Solexa/Illumina 1.0 format uses ASCII 59 through 126 to represent scores between -5 and 62. However, the current QualityEncoding constructor throws an error when the low end of the quality encoding range is less than the offset: This can be seen in src/FASTQ/quality.jl:

elseif low < offset
    error("Low end of in quality encoding range cannot be less than offset")

This may lead to errors when reading Solexa/Illumina 1.0 files with negative quality scores. I suggest removing these two lines, as it's an unnecessary contraint that only applies when creating a QualityEncoding, and actually prevents us from decoding certain formats properly and creating encodings for them.

Incorrect first character in SOLEXA_QUAL_ENCODING:

I also noticed that the first character for Solexa encoding is ASCII 64 '@', when it should be ASCII 59 ';'. This can be seen further down in src/FASTQ/quality.jl:

"Solexa (Solexa+64) quality score encoding"
const SOLEXA_QUAL_ENCODING     = QualityEncoding('@':'~', 64)

"Illumina 1.3 (Phred+64) quality score encoding"
const ILLUMINA13_QUAL_ENCODING = QualityEncoding('@':'~', 64)

If the ';' character was used, the current version would've thrown that low < offset error.

Examples

Current behavior:

julia> FASTQ.decode_quality(QualityEncoding('@':'~', 64), Int(';')) # current built-in Solexa encoding breaks
ERROR: Quality 59 not in encoding range 64:126

julia> FASTQ.decode_quality(QualityEncoding(';':'~', 64), Int(';')) # current constructor doesn't allow creating the correct encoding
ERROR: Low end of in quality encoding range cannot be less than offset

julia> collect(quality_scores(FASTQRecord("","ACGT",";?@A"), :solexa))
ERROR: Quality 59 not in encoding range 64:126

After removing the low < offset check and changing the first character of the Solexa encoding:

julia> FASTQ.decode_quality(QualityEncoding(';':'~', 64), Int(';'))

julia> collect(quality_scores(FASTQRecord("","ACGT",";?@A"), :solexa))
4-element Vector{Int64}:
 -5
 -1
  0
  1

Cheers!

jakobnissen commented 1 year ago

Thank you for the bug report.

You're right - I had no idea that Solexa didn't use PHRED scores, and so I thought any score < 0 must have been a mistake. Fixed in an upcoming PR.

AntonOresten commented 1 year ago

Awesome! I appreciate it!