Difficulties with codeml output

etetoolkit / ete

Python package for building, comparing, annotating, manipulating and visualising trees. It provides a comprehensive API and a collection of command line tools, including utilities to work with the NCBI taxonomy tree.

http://etetoolkit.org

GNU General Public License v3.0

797 stars 214 forks source link

Difficulties with codeml output #614

Open pedro-mmartins opened 2 years ago

pedro-mmartins commented 2 years ago

Hello! Hope this message finds you well.

I've just used ete3 to run codeml on a series of alignments. I ran models bsA and bsA1 to test for positive selection, but I'm havinf issues with interpreting the output. I have some doubts regarding the table in the output that informs whether a codon is positively selected (or conserved, etc). Sometimes there's a number for the codon position, but sometimes there is more than one number separated with a hifen (as in the exemplo below). I assumed that all codons within that range were positively selected, am I correct? Also, I noticed that sometimes not all codons are represented in the table, and even a different number of codons appear for each model. Does that have any meaning?

Thanks a lot!

fransua commented 2 years ago

Hi Pedro, I am missing the "example below".

pedro-mmartins commented 2 years ago

Hello, again!

I'm sorry, I changed some parts of my message and ended up forgetting it. I put it here now:

   68   |   Conserved (probability > 0.95)

69- 70 | Positively-selected (probability > 0.99) 72- 73 | Conserved (probability > 0.95)

You see the hifen, and also that codon #71 is missing.

Thanks

fransua commented 2 years ago

Hi, ok, I see. they are grouped. from codon 69 to 70 the codons belong to the category with omega is above 1 (positively selected) with a very high probability (0.99) nothing significant to report about codon 71, etc...

I group codons otherwise it can be a very long list. Also, this output is a summary of the CODEML original output that you may find in the main CODEML output file named "out"

hope this helps

pedro-mmartins commented 2 years ago

Thanks a lot!

pedro-mmartins commented 2 years ago

Hello again!

I've come across another question. When comparing the methods mentioned earlier (bsA and bsA1), I found some genes that had a significant pvalue, but had no positively selected codons within their sequence (all of them are considered to be conserved by the model). So, just to confirm that I'm not interpreting it in wrong way, when all codons within a gene are under purifying selection, the alternative model would be found significant because that's also a deviation from the null expectations? Is this right?

Thanks a lot again!

fransua commented 2 years ago

Hi again :) it can actually be possible that bsA is better, globally a class of sites with omega > 1 explains better the alignment, but individually no site has the statistical power to significantly pass the 0.95 probability threshold. however most of the times this occurs when bsA1 is not a really good fit either... did you tested bsA1 vs M1? Generally in these cases M1 is the best fit (all branches evolving at the same rate, two classes of sites).

Summary of the steps: http://etetoolkit.org/cookbook/ete_evol_lysozyme_branch-site.ipynb

pedro-mmartins commented 2 years ago

Oh, I see! I haven't tried that. Thanks a lot for the explanation, I'll try it out. :)

pedro-mmartins commented 2 years ago

Hello again! :)

Our cluster is not working very well, so it took me a while to run all the models. Now that I have my results, I have one (or two) more questions that I could not solve by reading the documentation.

Regarding the branch-site models we were discussing earlier, what would be the best approach to find the genes with positively selected sites? A colleague and I though about selecting the genes with significant p-values when comparing bsA and M1, and then, select out of this subset the ones with significant p-values when comparing bsA and bsA1. Would that be right?
We're also testing branch models (M0, b_neut, and b_free). We thought about comparing M0 and b_free first, and out of the significant subset, we compare b_freen and b_neut, to get our final set of genes that are not evolving neutrally. Is it ok?

Thanks!