Mafft did not compute any alignments error

Simarpreet-Kaur-Bhurji commented 4 months ago

Hello, I have run FastOMA on about 2200 genomes. It all went fine until the hog_big step but it throws a "Mafft did not compute any alignments error" at the hog_rest step. I have attached the error file here for your reference. Can you please help me with this? command.txt

sinamajidian commented 4 months ago

Hi Simarpreet Thanks for reaching out and sharing the log.

it would be great to have the fasta file of this rootHOG, clust148843: In the same folder where you got this .command.log file (work folder of Nextflow), there is a file called .command.sh. This includes the command line of fastOMA subpackage fastoma-infer-subhogs --input-rhog-folder. In the rhog folder, you can find the fasta file of clust148843. Based on the log it should be small having 23 proteins. rhog names that start with clust are outputs of linclust, those based on omamer start with HOG (if I'm not mistaken).. we also have some rhog merging.

It's hard to guess the root cause of the issue since the mafft error is only Segmentation fault (core dumped). So, I want to run mafft on this fasta file to see whether it was a one-time mafft issue or there is a specific char that mafft doesn't like. If mafft works well, I would run the same command line separately (outside of the nextflow) on a folder with only this roothog (fasta file).

(Please note that if we run things inside the work folder, this might change the hash values of files hindering resuming nextflow correctly. So, it's better to do debugging outside of nextflow work folder. Also note that some of the nextflow files are symlink).

Simarpreet-Kaur-Bhurji commented 4 months ago

Hi Sina, Thank you for your swift response. I did find the relevant fasta file based on your instructions and it seems that all the sequences in this file are X's, which explains the issue. Is there any step in fastoma that would filter these sequences out? We were thinking that it is a bit strange that a HOG was created with all X's. Wondering what are your thoughts about it.

sinamajidian commented 4 months ago

Clustering all Xs together is probably due to linclust being agnostic to alphabet but I can double check.
The first step of FastOMA is check_input. For proteomes we check whether the protein IDs are unique or the proteome has at least 2 proteins. Could you please tell us more about this case? for example, one of the input proteomes is like this?

>prot1
MEQKNVRNFCIIAHVDHGKSTLADRLLEYTGAISE
>prot2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> prot3
RPSDLIKLTVLINKKPVDALSFIVHADRAQKFARRVA

It looks a bit surprising to have a protein all Xs. (I could imagine such case in a multiple sequence alignment, MSA, after trimming). Anyway, we can extend fastoma's check and at least inform the user for such cases; specifically if this is an output of a widely used pipeline.

Simarpreet-Kaur-Bhurji commented 4 months ago

Hi Sina, There was some issue obtaining 2 input proteome files because of which it looked more like the following:

>prot1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>prot2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> prot3
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I have removed these files for now and rerun FastOMA; it seems to be going well so far. I agree that flagging such surprising cases would be beneficial. Thank you for your input.

DessimozLab / FastOMA

Mafft did not compute any alignments error #28