biobakery / phylophlan

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
https://huttenhower.sph.harvard.edu/phylophlan
MIT License
128 stars 33 forks source link

Fragmentary entries not accepted by RAxML #36

Closed TlaskalV closed 4 years ago

TlaskalV commented 4 years ago

Hello Francesco, thank you for great documentation for PhyloPhlAn. I looked for solution of my issue but I was not able to find it so I started new one. When analyzing couple of MAGs, I can see gaps in subsampled alignments:

>3300_18
YVVDLGDLTSWPDDRIG
>3376_74
-----------------
>33012_7
FAVDLGDLSSWPGDRIS

Later, RAxML outputs files (e.g. RAxML_info.p0111.tre) in the gene_tree2 folder with error for each MAG with gaps:

ERROR: Sequence 3376_74 consists entirely of undetermined values which will be treated as missing data

My command is below:

phylophlan -i /mnt/DATA01/user/mags/ \
-d phylophlan \
--diversity high \
-f /mnt/DATA01/user/mags/supertree_aa.cfg \
--proteome_extension .faa \
--fast \
--nproc 20 \
--maas /mnt/DATA01/user/mags/phylophlan.tsv \
--submod_folder /mnt/DATA01/user/mags/ \
--configs_folder /mnt/DATA01/user/mags/

I guess remove_fragmentary_entries works correctly, marker is not completely missing in non-subsampled alignment:

>3300_18
EKN-ENFDWNAFENY-DQPEQITEAYDKTLSNVAVGEVVEGTVTAITKREVLVNIYSEGV
IPVSEFRYNP---DLKVDKIEVYVESAEDKNQLALHKKARQLKSDRVNEALEKDEIIKGY
IKCRTKMIVDVGIEALGQIDVKPIRDYDIYVDKTMEFKVVKINQEFRNVVVHKALIEAEL
EAQKQVIMSKLEKQILETKNITSYVVDLGVDLIITDLSWGRVNHPEEIVSLDQKINVILD
FDDQKKRIAGLQLTPHEALDPNLKVGDKVKGRVVVMADYAVEIAPVEIVEMSSQHLRSAQ
EFMKVGDEVEAVILTLDREERKMSGIKQLTPDENIETKYPVTKCTAKVRNFNFVVEIEEG
IDGLIISLSTKKVKHPGEFTQVADIDVVVEIDKENRRLSLHKQLEENWNGFEAQFPVESI
HEGTITEMTDKAVVALGNIEGFCPARQLVEDG-----TTPKVGDKLNFKVIEFSKATKRI
TLLRTYDDARREA-AAAATKTKASEKTTLGDI--------
>3376_74
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------E------------IVLGEIVDITDFAVRIGPTDLLQV---------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
----------------------------------------
>33012_7
---ENDL--LSLEA------TM---LEN-------GQLVEGTVVRVDKDEVLIDIYSEGV
IPPRELSN-PTEV-VSLERIEAVVLQREDKERLVLKKRAQYERASKIEEVKEADGVVTGS
VIEVVKLIVDIGLRGLALVELRRVRDLQPYIGTNVEAKIIELDKNRNNVVLRRAWLEENQ
KEQREEFLHDLRPEVRKVSSVVNFAVDLGMDLIVSELSWKHVDHPGSIVTVGDEIDQVLE
IDMSRERISSLATQQDQEFATAHQVGELVYGRVTKLVPFAVQVGEIEVIEMSAHHVELPE
QVVTPAEELWVKIIDLDLQRRRISSIKQ-------------------A---------AEG
-----VA---------AE-----E-----HFGE-----------ADD-------------
-EGNIG----------G--DDSE-------------------------------------
----------------------------------------

Please, do you have any suggestion how to proceed with RAxML? Thank you in advance!

fasnicar commented 4 years ago

Hi, many thanks for reporting this. I'll update the code to check for all empty sequences even after subsampled alignments.

A quick fix can be the removal of that genome 3376_74 from the MSA and then run RAxML. Another option would be to discard that marker/gene tree from your analysis.

Many thanks, Francesco

TlaskalV commented 4 years ago

I think I would have to remove more markers and more genomes. RAxML gave error for tens of MAGs in MSA. I will try some command line approach to omit them. With MAGs (despite having high completeness) this might happen easily.

fasnicar commented 4 years ago

If you have many MSAs with the same problem you can import the phylophlan.py and use the is_msa_empty(msa, path=None) function, which returns True if there is at least an empty sequence in the msa file.

TlaskalV commented 4 years ago

I am able to get list of markers present after subsampling in each genome (i.e. subsampled MSAs which do not have gap-only entries), I am using PhyloPhlAn db with 400 markers. May I ask if there is a way how to reduce RAxML refining to only this subset of markers? It always runs for the whole set. The solution might be to build custom reduced database. Is there an easier approach? Many thanks!

fasnicar commented 4 years ago

If you want to make a custom db with a reduced set of markers you can make a copy of the database phylophlan folder and remove from the phylophlan.faa fasta file the markers you don't want to use. Then you just need to delete the indexed db file (phylophlan.dmnd if you used diamond) and re-launch PhyloPhlAn.