Open matrs opened 3 years ago
Hi, that's right. In my runs I get messages like this:
Analysing Orthogroups
=====================
2021-08-14 10:48:11 : Starting MSA/Trees
Species tree: Using 102 orthogroups with minimum of 63.3% of species having single-copy genes in any orthogroup
So these "single copy orthogroups" have at least 36.7% species (in my makes that means 9 species) where genes are not in a single copy. They are either missing or in multiple copies, which is understandable because I have many fragmented assemblies where genes are missing or are wrongly duplicated.
Hi @ViriatoII , thanks for your reply. I'm aware of that fact, but my point is related to using the N0.tsv (or other Nx.tsv) to define single copy genes. My N0.tsv tells me that x ogs are single copy, but when I check those ogxxxx.fa files, I see a few with more than one copy, which is unexpected to me given that my N0 file is telling me that those OGs have only single-copy genes in all the species.
Hi Jose Luis
The OG files are from the initial OGs, calculated using MCL clustering, described in Orthogroups/Orthogroups.tsv. The N0.tsv is determined using phylogenetic means to split fused orthogroups from the initial MCL analysis. You can use the tool OrthoFinder/tools/create_files_for_hogs.py to write out the corresponding fasta files for these HOGs.
The Single_Copy_Orthologue_Sequences comes from the old way to define OGs, as you say. If you want to calculate them for the N0.tsv file, you can use the tool tools/orthogroup_gene_count.py to get a count of genes per orthogroup per species, this can be analysed in excel or a scripting language to find those orthogroups that are single-copy in all species.
Are you saying that you are seeing cases of genes listed in "Phylogenetically_Misplaced_Genes/" that are present and single copy in all species in the N0.tsv file? If so, could you send an example (gene tree and the name of the effected gene), as this isn't something I'd expect.
In general genes in "Phylogenetically_Misplaced_Genes/" are those that appear to be out of place in the gene tree, and would otherwise negatively effect orthology analysis if not identified. They are identified algorithmically, so if you're concerned about them with respect to single copy genes, it might be worth doing some quality control checks in the gene trees for a few cases to see if you need to exclude them.
All the best David
Hi David,
thanks for your thorough reply, now I think I understand better how this works. When I use the create_files_for_hogs.py
and use the N0.tsv
to define present and single copy genes in all the species, I get HOGs that effectively have only one gene in all the species. I wanted to define HOGs of genes that are present and single-copy in all the species, but I was using the fasta files in Orthogoups/
. Related to this:
Are you saying that you are seeing cases of genes listed in "Phylogenetically_Misplaced_Genes/" that are present and single copy in all species in the N0.tsv file? If so, could you send an example (gene tree and the name of the effected gene), as this isn't something I'd expect.
As I tried to explain in the first post, when I use OGs that are single-copy and present in all the species according to N0.tsv
, I do see a few of those OGs in the Orthogoups/
directory which have more than one copy per specie (which didn't make sense to me, that's why I started this thread). In these particular cases, i.e. single copy according to N0
, not single copy according to the OG files inside Orthogoups/
, genes are present in Phylogenetically_Misplaced_Genes/
. As I understand it now, this procedure is wrong, because when defining orthogroups which are single-copy and present in all species using the Nx/tsv
files, I have to use the HOGs generated by create_files_for_hogs.py
and not the OGs in the Orthogoups/
directory. Is this right?
I found a couple of errors in create_files_for_hogs.py
.
In line 72 it failed because my runs took out a gene, so one line from Log.txt
started with #
:
i_species = [int(l.split(":")[0]) for l in species_ids_lines if l != "" and not l.startswith('#')]
Also, in line 238, args.orthofinder_results_dir
doesn't exist (Namespace error), args.orthofinder_results
does:
orthofinder_results_dir = args.orthofinder_results
Hello, Using
N0.tsv
, I'm trying to define the set of single copy orthogroups present in all species. in my case, using 198 bacterial genomes, I get 135 single copy orthogroups. When I check the sequences of these 135 orthogroups inOrthogroup_Sequences
, I get a few OG than have more than one gene per species:My first question: Is it expected to find single-copy orthogenes from
N0.tsv
that have more than one gene per species in theOGxxxxx.fa
files?When I check the folder
Single_Copy_Orthologue_Sequences
, I see 129 sequences, which are exactly the 135 OGs obtained fromN0.tsv
minus these six OGs with more than one sequence. My second questions is:Is the
Single_Copy_Orthologue_Sequences
the recommended way to define single copy orthogenes present in all the species? (As I understand it, this folder comes from the old way to define OGs).When I searched for information about the above genes in the
Putative_Xenologs/
andPhylogenetically_Misplaced_Genes/
, In some cases I see both of them, sometimes I see one of them. If using theNx.tsv
files to define sets of single copy OGs, what do I have to check to avoid potential problems related to the one described here?Not directly related, but while searching information about this, I saw a couple of posts talking about
Putative_Horizontal_Gene_Transfer.txt
, but none of my runs have such files.I'm running
OrthoFinder v2.5.4
(the last git version, which addresses #602)All the above is valid for two runs:
Regards,
Jose Luis