davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
706 stars 188 forks source link

Single variant per gene? #426

Closed nat2bee closed 4 years ago

nat2bee commented 4 years ago

Hi Davi,

I'm using OrthoFinder for the first time to compare some ant genomes and it is a very nice tool! I just have a question (maybe silly) about how not using a single variant per gene could affect the analyses.

I'm using the genes provided along the genome on NCBI for the comparisons. When I look at the fasta header of these sequences I can see we there are different isoforms for the same gene, like "gene X isoform Y1" and "gene X isoform Y2".
I tried to use the script you provide to deal with it (primary_transcript.py) but it does not remove any gene from these files. So I used a pipeline a colleague sent me based on the features table from NCBI, something like that:

for i in `cat list`; do less "$i".feature | fgrep "protein_coding" | awk '{print $12}' > tmp;
less "$i".feature | fgrep -f tmp | awk 'BEGIN{FS="\t";OFS="\t"}($1=="CDS"){print $11,$13,$16,$19}' | sort -k3,3 -k4,4nr | awk 'BEGIN{FS="\t";OFS="\t";l=0;g=0}{if($3==g) {if ($4>l) {print $0; g=$3; l=$4}} else {print $0; g=$3; l=$4} }' | awk '{print $1}' > tmp2;
fgrep -A 1 -f tmp2 "$i".faa | awk '{print $1}' > "$i".longestiso.fa ; done

This apparently works, but when I run Orthofinder in these filtered gene files the tree returned is not the expected one. When I run using the full gene files the tree is correct. Have you seen that before? I found in the tutorial that using the full gene files the run would take "longer than necessary and could lower the accuracy". I understand why it would take longer to run, but how does it affect accuracy? At first, I thought that this could increase the number of estimate gene duplications. Is that so?

Thanks a lot!

davidemms commented 4 years ago

Hi Natalia

I think the biggest issue accuracy wise is how to interpret the results. Generally we'd expect the set of isoforms for a gene to have the same evolutionary history and so for the orthogroups/orthologs they should all be grouped together. If there are cases when they aren't you might want to follow the majority of the isoforms for each gene. You will get the multiple isoforms labelled as coming from gene duplication events, most likely as terminal duplications (within an individual species rather than internal branches in the species tree). Beyond that I wouldn't normally think it would affect the accuracy too much but I haven't really tested using it that way.

In terms of the species trees, are you using the default tree inference option or the "-M msa" option? With the MSA option it could be a lot harder to infer the species tree when the multiple isoforms are present because there will be far fewer single-copy orthogroups. With the default option I think it would be more robust to these extra isoforms, but I haven't tested that. I can't think of reasons why it would be less accurate with just the longest coding variant. That's provided you are confident that you've selected them correctly? Have you got a link to one of the proteomes on NCBI that you ran the bash script on?

I guess the question for the species tree is how large is the difference between the two trees and what support values do the bipartitions have that are different between the two trees. It's also worth looking at how much data was used for the species tree inference. E.g. on the Example Dataset the default method prints:

269 trees had all species present and will be used by STAG to infer the species tree

And the MSA version:

Species tree: Using 246 orthogroups with minimum of 100.0% of species having single-copy genes in any orthogroup

Thanks also for alerting me to the issue of the NCBI files, I'll have a look at the primary transcripts script and to try and extend it to selecting the primary transcripts from these files too.

All the best David

nat2bee commented 4 years ago

Hi David, thanks for the response!

I see... So, if all species have the same number of isoforms that wouldn't be too much of problem right? I mean it wouldn't be a terminal duplication. Then, I guess the problem would be how well annotated all isoforms are for the different species being compared. Using all isoforms would make the comparison more sensitive to annotation bias.

Fo the tree, I actually see the opposite; when I use the "-M msa" option I got the tree I expected, but not with the default method. This is the tree I expect:

((Ooceraea_biroi:115.580959,((Linepithema_humile:100.556808,Pseudomyrmex_gracilis:100.556808):7.065545,(((Camponotus_floridanus:47.844355,(Cataglyphys_hispanica:27.788608,Formica_exsecta:27.788608):20.055747):9.119168,(Lasius_niger:46.774628,Nylanderia_fulva:46.774628):10.188895):42.560401,(Pogonomyrmex_barbatus:76.892943,((Temnothorax_curvispinosus:54.364423,Vollenhovia_emeryi:54.364423):12.038387,((Monomorium_pharaonis:48.307845,Solenopsis_invicta:48.307845):16.224257,(Wasmannia_auropunctata:58.533107,(Cyphomyrmex_costatus:33.869721,(Trachymyrmex_cornetzi:19.565510,(Atta_cephalotes:15.683135,Acromyrmex_echinatior:15.683135):3.882375):14.304211):24.663387):5.998995):1.870708):10.490133):22.630981):8.098430):7.958605):34.419041,(Harpegnathos_saltator:76.485454,Odontomachus_brunneus:76.485454):73.514546)root;

This is the tree I got using only the longest isoforms, with default:

((((Camponotus_floridanus:0.0617965,(Formica_exsecta:0.0402856,Cataglyphys_hispanica:0.0527318)0.661397:0.0240783)0.406033:0.0148713,(Lasius_niger:0.163545,Nylanderia_fulva:0.0735503)0.338764:0.0210637)0.570253:0.0431041,((Pseudomyrmex_gracilis:0.151211,Linepithema_humile:0.107771)0.18503:0.0125544,(Ooceraea_biroi:0.123907,(Harpegnathos_saltator:0.084069,Odontomachus_brunneus:0.0823861)0.899177:0.0829498)0.230198:0.0139497)0.186966:0.0112059)0.505243:0.0147961,((((Temnothorax_curvispinosus:0.0686123,Vollenhovia_emeryi:0.0776034)0.38974:0.0142743,(Wasmannia_auropunctata:0.0729127,(Cyphomyrmex_costatus:0.0548612,(Trachymyrmex_cornetzi:0.0285164,(Acromyrmex_echinatior:0.0285931,Atta_cephalotes:0.0387518)0.492821:0.0138892)0.666398:0.0239013)0.687046:0.0350294)0.223584:0.0109954)0.102759:0.00672683,(Monomorium_pharaonis:0.0725508,Solenopsis_invicta:0.065952)0.543313:0.0211609)0.35909:0.0171238,Pogonomyrmex_barbatus:0.0960639)0.505243:0.0147961);

And this is the species tree I got using the -M msa option, with the longest isoforms too:

((Harpegnathos_saltator:0.0562411,Odontomachus_brunneus:0.0530005)1:0.0411155,(Ooceraea_biroi:0.0988904,((Pseudomyrmex_gracilis:0.127034,Linepithema_humile:0.0816332)1:0.0115455,((Pogonomyrmex_barbatus:0.066228,((Vollenhovia_emeryi:0.0562465,Temnothorax_curvispinosus:0.0385064)1:0.0112041,((Wasmannia_auropunctata:0.0496312,(Cyphomyrmex_costatus:0.0349674,(Trachymyrmex_cornetzi:0.0140884,(Acromyrmex_echinatior:0.0143472,Atta_cephalotes:0.0175427)1:0.00456519)1:0.014257)1:0.029415)1:0.00768813,(Monomorium_pharaonis:0.0509217,Solenopsis_invicta:0.0441911)1:0.0188179)1:0.0033157)1:0.0156025)1:0.0310489,(((Formica_exsecta:0.0194831,Cataglyphys_hispanica:0.0286259)1:0.0170213,Camponotus_floridanus:0.0457785)1:0.00881532,(Lasius_niger:0.0331037,Nylanderia_fulva:0.0496313)1:0.0100505)1:0.0436927)1:0.00903953)1:0.0125881)1:0.0411155);

With the full protein set, using the default (didn't even tried the msa) I got the expected species tree.

I believe the filtering is working fine because it matches the annotation report. I'm giving you the example of Wasmannia_auropunctata; the full set have 24,085 proteins in the fasta file, after the filtering it remains 13,668 proteins, which is the number of protein coding genes according to the annotation report. This also exclude pseudogenes and non coding genes.

Indeed the two methods use a very different amount of data! Using the default:

6199 trees had all species present and will be used by STAG to infer the species tree

Using the msa:

Species tree: Using 3365 orthogroups with minimum of 100.0% of species having single-copy genes in any orthogroup

The number of orthogroups is the same in both:

OrthoFinder assigned 250595 genes (95.1% of total) to 13792 orthogroups. Fifty percent of all genes were in orthogroups with 20 or more genes (G50 was 20) and were contained in the largest 4529 orthogroups (O50 was 4529). There were 6199 orthogroups with all species present and 3365 of these consisted entirely of single-copy genes.

Finally, they differ when choosing the best outgroups for the tree. Default:

2020-07-12 18:47:57 : Done STRIDE
Observed 52 well-supported, non-terminal duplications. 52 support the best root and 0 contradict it.
Best outgroup for species tree:
  Odontomachus_brunneus, Nylanderia_fulva, Ooceraea_biroi, Pseudomyrmex_gracilis, Cataglyphys_hispanica, Camponotus_floridanus, Formica_exsecta, Lasius_niger, Harpegnathos_saltator, Linepithema_humile

Msa:

2020-07-12 23:33:36 : Done STRIDE
Observed 84 well-supported, non-terminal duplications. 84 support the best root and 0 contradict it.
Best outgroup for species tree:
  Odontomachus_brunneus, Harpegnathos_saltator

Just for comparison, using the full protein set and default:

Writing orthogroups to file
---------------------------
OrthoFinder assigned 226704 genes (95.6% of total) to 14167 orthogroups. Fifty percent of all genes were in orthogroups with 20 or more genes (G50 was 20) and were contained in the largest 3535 orthogroups (O50 was 3535). There were 8004 orthogroups with all species present and 851 of these consisted entirely of single-copy genes.
Inferring gene and species trees
--------------------------------

....

Inferring gene and species trees
--------------------------------
8004 trees had all species present and will be used by STAG to infer the species tree

Best outgroup(s) for species tree
---------------------------------
2020-06-26 15:51:58 : Starting STRIDE
2020-06-26 15:52:01 : Done STRIDE
Observed 109 well-supported, non-terminal duplications. 99 support the best root and 10 contradict it.
Best outgroup for species tree:
  Harpegnathos_saltator

All best, Natalia

davidemms commented 4 years ago

Hi Natalia

The number of isoforms won't matter too much as OrthoFinder infers gene trees. The isoforms should just all be sorted into their correct position (most likely terminal, which is fine). You could pick an example tree and take a look how this works out (the orthogroups are sorted in size order so something like orthogroup 1000 is normally big enough to be interesting while not being too large to understand). So it should be fairly robust to differences in annotation quality.

The two species trees you've inferred are actually a lot more similar than they first appear. Tree inference methods infer an unrooted species tree, it's then necessary to identify where the root is, which is a separate step, usually done manually. If you reroot the "longest isoforms, with default" species tree on the pair (Harpegnathos_saltator, Odontomachus_brunneus) you will see it now looks very similar to what you believe the true tree to be. The choice of root for the species tree can be quite difficult as there is usually very little information with which to do it (that's why not tree inference software attempts to do it).

All the best David

nat2bee commented 4 years ago

Thank you, David. That's all much clearer now!

All best, Natalia