AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
204 stars 25 forks source link

No alignments to any genomes #39

Closed SilasV123 closed 3 years ago

SilasV123 commented 3 years ago

Hi,

I'm sure I am doing something silly but I can't work out what it is. I am trying to generate a tree for a set of MAGs and genomes using the following command:

GToTree -f /net/fs-1/projects01/SoilCyc/gtotree/HQ_MAGS_filelist.txt \ -H Universal_Hug_et_al \ -D -L Species,Strain \ -j 20 \ -o /net/fs-1/projects01/SoilCyc/gtotree/HQ_MAGS_output

In the output I get the following issue:

16 gene(s) either had no hits in any genome, or only multiple hits per genome... Just so ya know!!

followed by:

** REASON FOR TERMINATION **
After filtering out genes that had either 0 hits in any genome OR only multiple hits, no genes remained. This typically shouldn't happen unless maybe there were very few genes being targeted, or very few genomes. You can consider running GToTree in "best-hit" mode (by providing the '-B' flag with no arguments), which will retain genes from genomes with multiple hits - but keep in mind that is less conservative. This is also a good time to make sure nothing weird is going on.

If someone could help me spot where I am going wrong it would great.

Regards, Silas

AstrobioMike commented 3 years ago

Hey there, Silas :)

Sorry you’ve hit some trouble! It seems like that might happen if the bins aren’t high-quality, but I’m guessing they are supposed to be based on your HQ_MAGs label. Were these filtered by checkm or something similar to make sure they are good ones? And how many MAGs are included?

Also, can you attach the SCG_counts.tsv file that will be in the output directory here so we can see if there are none or too many being found of these target genes?

And lastly, if you would be ok with sharing maybe just two of these MAG fasta files privately with me, I can look at things directly. An email you can send them to is mikeleebmsisorg if so :) -Mike

SilasV123 commented 3 years ago

Hi Mike,

Yes these are HQ MAGs assessed with checkM (completedness >50% and contamination <20%). There are also some isolate genomes mixed in which we know are complete and good quality (single contig).

I'm not sure how to attach files here but looking through SCG_hit_counts.tsv it just shows MAG names in the first column and the ribosomal protein names in the first row with nothing anywhere else, no matches. I will send the file in an email with a couple MAGs to you at the email you gave me, maybe along with the log file.

One thing to mention is that when I ran the test script it wasn't able to download the files needed for that run. Not sure if this failure to get files from online could be causing issues. I should also mention I have this running on a university cluster not on a PC.

Thanks for getting onto this so quickly for me.

Silas

SilasV123 commented 3 years ago

I forgot to mention there are 294 MAGs (and genomes) included.

Silas

AstrobioMike commented 3 years ago

Just updating while closing this issue. There seemed to be some problem with how the system was loading libraries when run through slurm, as the main GToTree program was unable to load biopython (even though it was a conda install and it was there). I wasn't able to replicate it unfortunately. Silas was able to run things on a local computer so we both moved on with our lives :)