davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
673 stars 186 forks source link

Number of genes appearing in N0.tsv, Orthogroups.GeneCounts.tsv and Statistics_perspecies does not agree #600

Open cgrootcrego opened 3 years ago

cgrootcrego commented 3 years ago

Hello,

Apologies if this issue was raised before, I couldn't find it after a search through existing issues.

Since I am very interested in the number of genes per species and orthogroup I have been looking into this, however I am quite confused about which output file reflects the most accurate and up-to-date result. Since the /Orthogroups directory is deprecated I turned to using N0.tsv for my analyses, and realized the total number of genes in orthogroups per species differs between the two approaches, which makes sense. However, the stats reported in /Comparative_genomics_Statistics is reporting the gene numbers from Orthogroups.GeneCounts.tsv and not from N0.tsv.

To give you an example: Number of genes in Orthogroups.GeneCounts.tsv:

Number of genes in orthogroups according to N0.tsv:

Number of genes in orthogroups reported in Statistics_perSpecies:

So, does this mean the Statistics_PerSpecies file is also deprecated and not representing accurate statistics, or is the N0.tsv file not reporting all genes in orthogroups? I eventually want to know to whichorthogroup each gene belongs and what the per-species count is in each orthogroup. Originally I compiled this info from Orthogroups.GeneCount.tsv and Orthogroups.tsv. Since there is no count file associated to N0.tsv, I have made a script to create such a file from scratch. But now I am not sure anymore which file is reporting the correct numbers.

Thanks a lot and sorry for my confusion!

Clara

matrs commented 3 years ago

I'm wondering the same. It seems that all the files inside Comparative_genomics_Statistics/ are made from the files in Orthogroups/, but I'm not 100% sure.

davidemms commented 3 years ago

Hi Clara

That's right, the Statistics_PerSpecies file hasn't been updated yet to reflect the numbers in the N*.tsv files. That will happen in OrthoFinder 3. You can apply the "tools/orthogroup_gene_count.py" file the the N0.tsv to get the gene counts in that or other hierarchical orthogroup files.

Best wishes David