duncanca / mosaik-aligner

Automatically exported from code.google.com/p/mosaik-aligner
0 stars 0 forks source link

quality score in unaligned readss in FASTQ file #26

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

MosaikBuild -fr test.solexa.fasta -fq test.solexa.qual -out
test.solexa.fasta.mosaik.dat -st illumina
MosaikAligner -in test.solexa.fasta.mosaik.dat -out test.mosaik.aligned.dat
-ia fasta.ref.dat -hs 15 -mm 6 -mhp 100 -act 13 -rur test.unalinged.fastq

What is the expected output? What do you see instead?

Expected output
>LANLGAIIX_6_1000055:8:1:0:1821
-3 1 -2 -2 2 -3 -5 11 -3 -5 -3 -3 1 -1 -5 20 4 5 -2 -5 5 4 -1 8 0 5 -2 -3
-2 -3 -2 1 4 6 -1 4 2 1 -2 -4 9 2 -1 -3 4 0 14 -4 -3
Observed:
@LANLGAIIX_6_1000055:8:1:0:1821 (mate 1, length=49)
CTGCCTNACNAGAANCGAANCTCCGCCGCCCCACTAACTCCTTTACCGC
+
^^"^_^_#^^^\,^^^\^^^^" ^\5%&^_^\&% )!&^_^^^_^^^_"%' %#"^_^]*# ^^%!/^]^^

The ascii code is longer than the length of sequence (49).

What version of the product are you using? On what operating system?

version:1.0.1367
OS: x86_64 GNU/Linux

Please provide any additional information below.

I try to understand the ascii code for negative quality score. The compact
form of fastq should be one acsii code to quality score mapping. But now it
is two ascii codes. 
-5 => ^\
-4 => ^]
-3 => ^^
-2 => ^_
-1 => space

Do you have any suggestions to convert it back to decimal quality score?
Thanks,
Chien-Chi 

Original issue reported on code.google.com by Lo.chien...@gmail.com on 12 Jan 2010 at 9:01

GoogleCodeExporter commented 9 years ago
Hi there,

When MOSAIK imports in files using MosaikBuild, it converts all of the 
qualities into
the standard phred score definition. For most data sets, this means that the
qualities are taken as is.

Illumina decided to be smart and created a new base quality definition. Above 
base
quality 10 they are tangentially equivalent. However whereas the phred scale 
offers
poor resolution at base qualities under 10, Illumina offers higher resolution by
offering negative base qualities - sort of a log odds approach.

In essence all bases that have a base quality of less than 10 are crap anyway, 
so
it's a bit of an academic discussion.

The unaligned reads you see in MosaikAligner fastq output use the fastq 
specification
developed at the Sanger which means that BQ + 33 = ASCII code for the base 
quality.

e.g. to parse fastq files all you have to do is subtract 33 from each ASCII 
code you
see in the base quality line in order to get all of your base qualities for 
that read.

For some extra trivia, when Illumina creates fastq files in the Gerald 
directory they
use BQ + 64 = ASCII code since they use negative numbers.

Hope this helps,

// Michael

Original comment by snowneb...@gmail.com on 16 Jan 2010 at 2:04

GoogleCodeExporter commented 9 years ago
Thanks,

I understand those negative scores are crap but I would like to understand why
unaligned reads can not be mapped to reference genome. Is it quality issue or
sequence problem? What are they?

For the example that I posted, the postive quality scores can be transformed by
simply subtract 33 from ASCII codes but it is not just subtarct 33 from each 
ASCII
code for negative quality socres becasue it is not one acsii to one number
transformation (the lengthes of sequence and quality are not match). I think I 
can
write a parser based on the rule I observed but the unaligend reads output from
Mosaik alinger is not fastq format anymore.

acsii => quality score
^\ => -5
^] => -4
^^ => -3
^_ => -2

Original comment by Lo.chien...@gmail.com on 16 Jan 2010 at 4:54