davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
673 stars 186 forks source link

Some single copy OGs from N0.tsv have more than one gene per species in Orthogroup_Sequences #605

Open matrs opened 3 years ago

matrs commented 3 years ago

Hello, Using N0.tsv, I'm trying to define the set of single copy orthogroups present in all species. in my case, using 198 bacterial genomes, I get 135 single copy orthogroups. When I check the sequences of these 135 orthogroups in Orthogroup_Sequences, I get a few OG than have more than one gene per species:

OG0000338
GCF_000723745.2_PRJEB6027_Urmite_genomic ['CDPOOEEH_02301', 'CDPOOEEH_02302']
GCF_004015005.1_ASM401500v1_genomic ['ENBNGPOG_02495', 'ENBNGPOG_02496']
OG0000347
GCF_008423175.1_ASM842317v1_genomic ['FJFNHMKP_00227', 'FJFNHMKP_01063']
OG0000375
GCF_000723745.2_PRJEB6027_Urmite_genomic ['CDPOOEEH_02123', 'CDPOOEEH_02124']
OG0000402
GCF_004015285.1_ASM401528v1_genomic ['LHHNCKFK_02445', 'LHHNCKFK_02446']
OG0000408
GCF_004015305.1_ASM401530v1_genomic ['LIBLEENJ_01444', 'LIBLEENJ_01445']
OG0000435
GCF_010229695.1_ASM1022969v1_genomic ['ADBMNDNC_01402', 'ADBMNDNC_01403']

My first question: Is it expected to find single-copy orthogenes from N0.tsv that have more than one gene per species in the OGxxxxx.fa files?

When I check the folder Single_Copy_Orthologue_Sequences, I see 129 sequences, which are exactly the 135 OGs obtained from N0.tsv minus these six OGs with more than one sequence. My second questions is:

Is the Single_Copy_Orthologue_Sequences the recommended way to define single copy orthogenes present in all the species? (As I understand it, this folder comes from the old way to define OGs).

When I searched for information about the above genes in the Putative_Xenologs/ and Phylogenetically_Misplaced_Genes/, In some cases I see both of them, sometimes I see one of them. If using the Nx.tsv files to define sets of single copy OGs, what do I have to check to avoid potential problems related to the one described here?

Not directly related, but while searching information about this, I saw a couple of posts talking about Putative_Horizontal_Gene_Transfer.txt, but none of my runs have such files.

I'm running OrthoFinder v2.5.4 (the last git version, which addresses #602)

All the above is valid for two runs:

 -ft  Results_Jul29 -y 
 -ft  Results_Jul29 -s tree -y 

Regards,

Jose Luis

ViriatoII commented 3 years ago

Hi, that's right. In my runs I get messages like this:

Analysing Orthogroups
=====================
2021-08-14 10:48:11 : Starting MSA/Trees
Species tree: Using 102 orthogroups with minimum of 63.3% of species having single-copy genes in any orthogroup

So these "single copy orthogroups" have at least 36.7% species (in my makes that means 9 species) where genes are not in a single copy. They are either missing or in multiple copies, which is understandable because I have many fragmented assemblies where genes are missing or are wrongly duplicated.

matrs commented 3 years ago

Hi @ViriatoII , thanks for your reply. I'm aware of that fact, but my point is related to using the N0.tsv (or other Nx.tsv) to define single copy genes. My N0.tsv tells me that x ogs are single copy, but when I check those ogxxxx.fa files, I see a few with more than one copy, which is unexpected to me given that my N0 file is telling me that those OGs have only single-copy genes in all the species.

davidemms commented 3 years ago

Hi Jose Luis

The OG files are from the initial OGs, calculated using MCL clustering, described in Orthogroups/Orthogroups.tsv. The N0.tsv is determined using phylogenetic means to split fused orthogroups from the initial MCL analysis. You can use the tool OrthoFinder/tools/create_files_for_hogs.py to write out the corresponding fasta files for these HOGs.

The Single_Copy_Orthologue_Sequences comes from the old way to define OGs, as you say. If you want to calculate them for the N0.tsv file, you can use the tool tools/orthogroup_gene_count.py to get a count of genes per orthogroup per species, this can be analysed in excel or a scripting language to find those orthogroups that are single-copy in all species.

Are you saying that you are seeing cases of genes listed in "Phylogenetically_Misplaced_Genes/" that are present and single copy in all species in the N0.tsv file? If so, could you send an example (gene tree and the name of the effected gene), as this isn't something I'd expect.

In general genes in "Phylogenetically_Misplaced_Genes/" are those that appear to be out of place in the gene tree, and would otherwise negatively effect orthology analysis if not identified. They are identified algorithmically, so if you're concerned about them with respect to single copy genes, it might be worth doing some quality control checks in the gene trees for a few cases to see if you need to exclude them.

All the best David

matrs commented 3 years ago

Hi David, thanks for your thorough reply, now I think I understand better how this works. When I use the create_files_for_hogs.py and use the N0.tsv to define present and single copy genes in all the species, I get HOGs that effectively have only one gene in all the species. I wanted to define HOGs of genes that are present and single-copy in all the species, but I was using the fasta files in Orthogoups/. Related to this:

Are you saying that you are seeing cases of genes listed in "Phylogenetically_Misplaced_Genes/" that are present and single copy in all species in the N0.tsv file? If so, could you send an example (gene tree and the name of the effected gene), as this isn't something I'd expect.

As I tried to explain in the first post, when I use OGs that are single-copy and present in all the species according to N0.tsv, I do see a few of those OGs in the Orthogoups/ directory which have more than one copy per specie (which didn't make sense to me, that's why I started this thread). In these particular cases, i.e. single copy according to N0, not single copy according to the OG files inside Orthogoups/ , genes are present in Phylogenetically_Misplaced_Genes/. As I understand it now, this procedure is wrong, because when defining orthogroups which are single-copy and present in all species using the Nx/tsv files, I have to use the HOGs generated by create_files_for_hogs.py and not the OGs in the Orthogoups/ directory. Is this right?

I found a couple of errors in create_files_for_hogs.py . In line 72 it failed because my runs took out a gene, so one line from Log.txt started with #:

i_species = [int(l.split(":")[0]) for l in species_ids_lines if l != "" and not l.startswith('#')]

Also, in line 238, args.orthofinder_results_dir doesn't exist (Namespace error), args.orthofinder_results does:

orthofinder_results_dir = args.orthofinder_results