davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
692 stars 187 forks source link

MSA failure (no known reason) #779

Open drabe004 opened 1 year ago

drabe004 commented 1 year ago

Hi,

I've been trying to run a couple datasets (one small and one large) using -M msa + RaxML.

I'm puzzled as to what is happening. I am running the two simultaneously but on different partitions and in different directories (though both are being run from the same /Software directory.

Both complete runs with empty MSA Directories and an empty species tree.

For both datasets (which work find sans -M msa) I get the warnings:::

WARNING: list index out of range WARNING: An MSA failed for an unknown reason: $PATH/Results_Jan20_2/WorkingDirectory/Alignments_ids/OG0002211.fa

WARNING: program called by OrthoFinder produced output to stderr

Command: mcl /panfs/jay/groups/26/mcgaughs/drabe004/Fernando_fish/Prots/primary_transcripts/OrthoFinder/Results_Jan20_2/WorkingDirectory/OrthoFinder_graph.txt -I 1.5 -o /panfs/jay/groups/26/mcgaughs/drabe004/Fernando_fish/Prots/primary_transcripts/OrthoFinder/Results_Jan20_2/WorkingDirectory/clusters_OrthoFinder_I1.5.txt -te 100 -V all

stdout

b'' stderr

b'[mcl] cut <1> instances of overlap\n' 2023-01-20 13:10:06 : Ran MCL

Writing orthogroups to file

OrthoFinder assigned 604307 genes (98.5% of total) to 24160 orthogroups. Fifty percent of all genes were in orthogroups with 33 or more genes (G50 was 33) and were contained in the largest 5600 orthogroups (O50 was 5600). There were 9101 orthogroups with all species present and 832 of these consisted entirely of single-copy genes.

2023-01-20 13:27:08 : Done orthogroups

davidemms commented 1 year ago

Hi

In general RAxML is too heavy-weight for a genome wide analysis like this. However, you can investigate running RAxML yourself on a small alignment in your results directory and the largest one and seeing if there are any particular issues come up:

raxmlHPC-AVX -m PROTGAMMALG -p 12345 -s $PATH/Results_Jan20_2/WorkingDirectory/Alignments_ids/OG0002211.fa -n OG0002211 -w $PATH/Results_Jan20_2/WorkingDirectory/Alignments_ids/

and

raxmlHPC-AVX -m PROTGAMMALG -p 12345 -s $PATH/Results_Jan20_2/WorkingDirectory/Alignments_ids/OG0000000.fa -n OG0000000 -w $PATH/Results_Jan20_2/WorkingDirectory/Alignments_ids/

You can also edit the options for how OrthoFinder calls RAxML in the scripts_of/config.json file - that's where these commands come from.

Best wishes David

drabe004 commented 1 year ago

Hi David,

This error happened also just when I use -M msa (and no tree building specified so I assume its trying to use fasttree). It does work if I use the defaults (dendroblast+STAG) but obviously I don't have any multiple sequence alignments at the end of that... which I need.

Any idea why that is?

is -M msa failing because this is computationally too much as well? (would it be faster/ more efficient to use MAFFT on the output orthogroups lists?).

davidemms commented 1 year ago

Do you know if it fails at the MSA or tree inference stage? You can look in the directories Results_Jan20_2/WorkingDirectory/Alignments_ids/ and Tree_ids/, there should be a file for each orthogroup which has 4 or more genes (they are ordered by size).

If one or other stage is failing then trying MAFFT, FastTree directly might reveal the problem.

drabe004 commented 1 year ago

hmmm ok, the align id's has some seq files in it but only ~130 tree ids are empty. I'll try running it explicitely with MAFFT +fasttree and see what happens!

drabe004 commented 1 year ago

OK-- I've run this now with -M msa -T fasttree -A mafft it fails and the seqids are full >33k files. However the alignment IDs folder has files in it (~2500), but all the files are empty.

drabe004 commented 1 year ago

Just revisiting this now to try and get the -msa option working for our data.

I've run MAAFT and FASTTREE on one of the Sequence_ids files and both worked perfectly fine

Looking at the error file it seems like orthofinder is having some issue at the stage where it is calling scripts to write MSAs, then the trees are failing because the MSAs are empty.

drabe004 commented 1 year ago

Hi!

I'm just revisiting this and trying to get the -msa option running for orthofinder.

Attaching the out file here in case you're able to take a look.

I can certainly go and run MAAFT and FASTTREE separately (those work fine on the seqID files), but I'm hoping to use orthofinder for quite a few large datasets and it would be great if there was a way to easily remedy this!

Let me know if you have a moment to take a look!

Best,

~Danielle

On Wed, Jan 25, 2023 at 1:12 PM David Emms @.***> wrote:

Do you know if it fails at the MSA or tree inference stage? You can look in the directories Results_Jan20_2/WorkingDirectory/Alignments_ids/ and Tree_ids/, there should be a file for each orthogroup which has 4 or more genes (they are ordered by size).

If one or other stage is failing then trying MAFFT, FastTree directly might reveal the problem.

— Reply to this email directly, view it on GitHub https://github.com/davidemms/OrthoFinder/issues/779#issuecomment-1404107406, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEQMAA3N2ONRI7E3C5MJPTWUF3LNANCNFSM6AAAAAAUB4LYAE . You are receiving this because you authored the thread.Message ID: @.***>

-- Danielle H Drabeck PhD Postdoctoral Fellow

Department of Ecology, Evolution, and BehaviorUniversity of Minnesota

@. @. Pronouns: She/Her/Hers