davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
646 stars 185 forks source link

what is the Orthogroup column in N*.tsv? #910

Open aguang opened 1 week ago

aguang commented 1 week ago

I'm just wondering, what is the Orthogroup column in N*.tsv mean? I think the distinct orthogroups should be the HOGs, but sometimes I see that different rows will have the same Orthogroup ID, and I'm not sure what if anything that is supposed to mean. Are they Orthogroup IDs for a given gene tree and otherwise have no relation?

Example (I am looking at these files in R):

 HOG           OG         Gene_Tree_Parent_Clade   S1                          S2                      S3  S4 S5      S6   
  <chr>         <chr>      <chr>                  <chr>                       <chr>                       <chr> <chr> <chr>       <chr>
1 N0.HOG0025976 OG0019568  n1                     NA                          TRINITY_DN37348_c1_g7_i1.p1 NA    NA    TRINITY_DN… TRIN…
2 N0.HOG0025977 OG0019568  n3                     TRINITY_DN55820_c0_g1_i1.p1 NA                          NA    NA    TRINITY_DN… TRIN…
lauriebelch commented 1 week ago

Hi aguang,

The HOGs (1st column) are the distinct orthogroups. In the example you have shown, two HOGs have the same Orthogroup ID. This is because those two HOGs were mistakenly merged into one in the clustering step. OrthoFinder sees that there is a duplication at the root, and so correctly seperates them into two HOGs in the N0 file

There is some discussion of this in https://github.com/davidemms/OrthoFinder/issues/367

Hope this is useful!

Thanks,

Laurie