davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
673 stars 186 forks source link

Failed to execute script orthofinder #664

Open shrhops opened 2 years ago

shrhops commented 2 years ago

Hello, I have 5 .pep files that I'd like to run through Orthofinder. One of them is a reference file I got from NCBI, while the other 4 I generated using TransDecoder (LongOrfs and Predict). This worked for another dataset with the same reference but 4 different .pep files.

I first ran it as normal, getting this error:

Traceback (most recent call last):

  File "orthofinder.py", line 7, in <module>

 File "scripts_of/__main__.py", line 1778, in main

 File "scripts_of/__main__.py", line 1558, in GetOrthologues

  File "scripts_of/orthologues.py", line 1039, in OrthologuesWorkflow

  File "scripts_of/stride.py", line 509, in GetRoot

  File "scripts_of/tree.py", line 221, in __init__

  File "scripts_of/newick.py", line 216, in read_newick

scripts_of.newick.NewickError: Unexisting tree file or Malformed newick tree structure.
[20317] Failed to execute script orthofinder

I then tried to run it with -M msa instead, as recommended here, but that gave me the error

2022-01-27 18:17:21 : Starting MSA/Trees
Species tree: Using 0 orthogroups with minimum of 50.0% of species having single-copy genes in any orthogroup

Inferring multiple sequence alignments for species tree
-------------------------------------------------------
All MSAs for the concatenated multiple sequence alignment were empty.
Please correct the error and re-run.
ERROR: An error occurred, ***please review the error messages*** they may contain useful information about the problem.

This is the initial report after writing orthogroups to file: OrthoFinder assigned 45450 genes (94.1% of total) to 6093 orthogroups. Fifty percent of all genes were in orthogroups with 11 or more genes (G50 was 11) and were contained in the largest 1011 orthogroups (O50 was 1011). There were 0 orthogroups with all species present and 0 of these consisted entirely of single-copy genes.

I assume the 0 orthogroups here is the issue. Could it have also been the problem with the first run, without MSA? If not, how can I fix this?

Thanks in advance!

davidemms commented 2 years ago

Hi

Yes, it looks like that is the problem. For some reason not a single orthogroup has at least 50% of your species present (3 species in this case). Are these input proteomes complete? Even for partially incomplete transcriptomes there should be significant overlap such that there are some genes that are present in all 5 of your species. Do you know why that doesn't seem to be the case for your input files?

Best wishes David

shrhops commented 2 years ago

Hi David, thanks for your response. The input files are actually one reference proteome from NCBI, while the others are my sample files, which are sequences of differentially expressed genes. Two of my input files only had a few DE genes in the first place ( < 10), so it might make sense that they don't show up in the other sets.

In theory, I could add a dummy gene/sequence to all the files, and that would make it work, right? If I'm understanding it correctly, there's no way to run it on this dataset as is?

davidemms commented 2 years ago

Can you include the sequences from the non-DE genes from that species? That will allow OrthoFinder to work out the distribution of gene divergences for that species.