Gaius-Augustus / Augustus

Genome annotation with AUGUSTUS
http://bioinf.uni-greifswald.de/webaugustus/
289 stars 110 forks source link

What is the meaning of the score in the 6th column in the output? #423

Closed laurahum closed 1 month ago

laurahum commented 1 month ago

I am running augustus like this:

augustus sequence.fasta \
--proteinprofile=proteinprofile.prf1 \
--softmasking=0 \
--gff3=on  \
--species=human > output.gff3

and I get this kind of output: sequence AUGUSTUS gene 1553155 1602091 1 + . ID=g28;score=1 sequence AUGUSTUS CDS 1553155 1553245 . + 0 Parent=g28.t1 sequence AUGUSTUS CDS 1555225 1555323 . + 0 Parent=g28.t1 sequence AUGUSTUS CDS 1557037 1557132 . + 0 Parent=g28.t1 sequence AUGUSTUS CDS 1597926 1598077 . + 0 Parent=g28.t1 sequence AUGUSTUS CDS 1598173 1598181 . + 0 Parent=g28.t1 sequence AUGUSTUS CDS 1599530 1599624 . + 0 Parent=g28.t1 sequence AUGUSTUS CDS 1602055 1602091 . + 0 ID=g28.t1.cds;Parent=g28.t1 sequence AUGUSTUS start_codon 1553155 1553157 . + 0 Parent=g28.t1 sequence AUGUSTUS interblock_region 1553155 1553245 . + 0 ID=pp.g28.t1.iBR0 sequence AUGUSTUS transcript 1553155 1602091 . + . ID=g28.t1;Parent=g28 sequence AUGUSTUS interblock_region 1555225 1555247 . + 2 ID=pp.g28.t1.iBR0 sequence AUGUSTUS protein_match 1555248 1555307 1.78 + 0 ID=pp.g28.t1.unknown_B;Target=unknown_B 1 20;target_start=38;score=1.78 sequence AUGUSTUS interblock_region 1555308 1555323 . + 0 ID=pp.g28.t1.iBR1 sequence AUGUSTUS interblock_region 1557037 1557128 . + 2 ID=pp.g28.t1.iBR1 sequence AUGUSTUS protein_match 1557129 1557132 0.665 + 0 ID=pp.g28.t1.unknown_D;Target=unknown_D 1 2;target_start=94;score=0.665 sequence AUGUSTUS protein_match 1597926 1598077 2.45 + 1 ID=pp.g28.t1.unknown_D;Target=unknown_D 2 53;target_start=93;score=2.45 sequence AUGUSTUS protein_match 1598173 1598175 0.824 + 0 ID=pp.g28.t1.unknown_D;Target=unknown_D 53 53;target_start=94;score=0.824 sequence AUGUSTUS protein_match 1598176 1598181 1.3 + 0 ID=pp.g28.t1.unknown_E;Target=unknown_E 1 2;target_start=147;score=1.3 sequence AUGUSTUS protein_match 1599530 1599604 1.89 + 0 ID=pp.g28.t1.unknown_E;Target=unknown_E 3 27;target_start=147;score=1.89 sequence AUGUSTUS interblock_region 1599605 1599624 . + 0 ID=pp.g28.t1.iBR3 sequence AUGUSTUS interblock_region 1602055 1602091 . + 1 ID=pp.g28.t1.iBR3 sequence AUGUSTUS stop_codon 1602089 1602091 . + 0 Parent=g28.t1

I would like to know what does it mean the value in the 6th column. In every gene feature predicted in my output the score=1 and in the different protein_match features the score varies, for example score = 0.665 or score = 2.45.

I read in the documentation that if the flag 'sample' is selected there is a score generated but I did not use that flag. Also in the documentation it talks about a score produced when using hints, but I did not provide any hints to the run.

So I'm a bit confused about what this score means. If you could give me an answer on what does this score mean and how is it calculated by Augustus it would be really helpful!

Thanks!

MarioStanke commented 1 month ago

The score of protein_match lines have a completely different interpretation than the scores of predicted gene structures, e.g. CDS lines. The latter are probabilities, if present.

protein_match scores measure the similarity of the input protein profile to the predicted protein sequence in a region. It is computed from an odds ratio

P (proteinseq | profile) / P(proteinseq | background model)

The value is normalized with the length of the sequence by taking the r-th root when the match has length r. Values larger than 1 mean that the profile matches bettern than the background model and should be predominant.

This has been implemented by Oliver Keller and details are described in his dissertation:

https://ediss.uni-goettingen.de/handle/11858/00-1735-0000-0006-B6A7-D

"The odds score reflects the ratio of probabilities in the background model to the model defined by the mapping"

Here is the code with the root-taking: https://github.com/Gaius-Augustus/Augustus/blob/955ce1731e9bdd1c670216ed3c514f978c033891/include/pp_scoring.hh#L247