davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
673 stars 186 forks source link

cds vs aa runs - help interpreting results #509

Open jcerca opened 3 years ago

jcerca commented 3 years ago

Dear David,

Orthofinder is such a key part of my work. I am really happy about this brilliant piece of software. I've recently run orthofinder with 8 .faa sets and 8 cds.fa sets (all downloaded from ncbi, with the exception of my study species). They're the same dataset and the inputs are thus very similar. However, I have very different results when running these.

orthofinder -f . -t 50 # aa orthofinder -f . -d -t 50 # cds

Run with aminoacids: 755 single-copy orthologs Run with cds: 180 single-copy orthologs

Tree with aminoacids: (ARABI:0.215088,(CONYZ:0.147142,((CYNAR:0.160904,LSATI:0.110784)0.382882:0.029275,(MIKAN:0.121807,(HELIA:0.0975381,(SCSGB:0.08229,SCSGA:0.0799066)0.397024:0.0479484)0.631114:0.0587112)0.506629:0.0377795)0.219063:0.0283187)1:0.215088);

Tree with cds: (ARABI.:0.627765,((HELIA.cds:0.127897,(SCSGB.cds:0.168748,SCSGA.cds:0.159182)0.341639:0.129129)0.472928:0.181023,(MIKAN.cds:0.266436,(CONYZ.:0.390434,(CYNAR.:0.252489,LSATI.cds:0.331545)0.423575:0.135364)0.366076:0.138636)0.291806:0.153261)1:0.627765);

Duplications_per_Species_Tree_Node.tsv - aminoacids Species Tree Node Duplications (all) Duplications (50% support) ARABI 9389 9389 CONYZ 21454 21454 CYNAR 3126 3126 HELIA 14403 14403 LSATI 10057 10057 MIKAN 18692 18692 SCSGA 1602 1602 SCSGB 1045 1045

Duplications_per_Species_Tree_Node.tsv - cds Species Tree Node Duplications (all) Duplications (50% support) ARABI 3049 3049 CONYZ 22043 22043 CYNAR 3785 3785 HELIA 11793 11793 LSATI 9163 9163 MIKAN 19242 19242 SCSGA 1440 1440 SCSGB 912 912

Orthogroups_SpeciesOverlaps.tsv - aa ARABI CONYZ CYNAR HELIA LSATI MIKAN SCSGA SCSGB ARABI 13600 11375 11860 10240 12099 11495 9476 9283 CONYZ 11375 15585 13035 11729 13524 13189 10391 10162 CYNAR 11860 13035 15281 11947 14022 13166 10627 10436 HELIA 10240 11729 11947 15907 12296 12123 10096 9782 LSATI 12099 13524 14022 12296 16829 13739 10850 10632 MIKAN 11495 13189 13166 12123 13739 16591 10468 10205 SCSGA 9476 10391 10627 10096 10850 10468 12169 10347 SCSGB 9283 10162 10436 9782 10632 10205 10347 11605

Orthogroups_SpeciesOverlaps.tsv - cds ARABI CONYZ CYNAR HELIA LSATI MIKAN SCSGA SCSGB ARABI 6432 3001 3107 2916 3185 3031 2751 2703 CONYZ 3001 16218 13270 11692 13061 12667 10661 10463 CYNAR 3107 13270 17592 12636 15474 13999 11545 11307 HELIA 2916 11692 12636 18362 12491 12942 11920 11536 LSATI 3185 13061 15474 12491 18575 13805 11334 11133 MIKAN 3031 12667 13999 12942 13805 18546 11556 11311 SCSGA 2751 10661 11545 11920 11334 11556 14643 12445 SCSGB 2703 10463 11307 11536 11133 11311 12445 13989

Statistics_Overall.tsv - aa Number of species 8 Number of genes 270888 Number of genes in orthogroups 251594 Number of unassigned genes 19294 Percentage of genes in orthogroups 92.9 Percentage of unassigned genes 7.1 Number of orthogroups 25469 Number of species-specific orthogroups 6983 Number of genes in species-specific orthogroups 43464 Percentage of genes in species-specific orthogroups 16.0 Mean orthogroup size 9.9 Median orthogroup size 7.0 G50 (assigned genes) 13 G50 (all genes) 13 O50 (assigned genes) 4988 O50 (all genes) 5730 Number of orthogroups with all species present 6788 Number of single-copy orthogroups 755

Statistics - cds Number of species 8 Number of genes 270867 Number of genes in orthogroups 232718 Number of unassigned genes 38149 Percentage of genes in orthogroups 85.9 Percentage of unassigned genes 14.1 Number of orthogroups 33328 Number of species-specific orthogroups 11111 Number of genes in species-specific orthogroups 48533 Percentage of genes in species-specific orthogroups 17.9 Mean orthogroup size 7.0 Median orthogroup size 5.0 G50 (assigned genes) 9 G50 (all genes) 8 O50 (assigned genes) 6899 O50 (all genes) 9136 Number of orthogroups with all species present 2087 Number of single-copy orthogroups 180

Judging from the statistics, it comes across that matching of orthologues was better achieved in the *cds file. Is this expected? Are there striking differences expected between the aa and cds runs?

I believe the next I'll try is to generate my own aa and cds files from the gff3 files instead of relying on those uploaded from the ncbi. But just doing a grep -c ">" on the faa and cds, they recover similar numbers for both files.

Thank you for your time! José

davidemms commented 3 years ago

Hi José

This sounds like a great dataset to dive into. I haven't done any side-by-side comparisons myself. I guess the most interesting thing would be cases were one method says a pair of genes are orthologs and this is supported by the gene tree whereas the other method says they are not. The orthogroups numbers you've posted are before the phylogenetic analyses have been performed, what are the statistics like for orthologs? And are the N0.tsv orthogroups (the phylogenetically derived orthogroups) more similar to one another than the original clustering based Orthogroups.tsv files?

All the best David