Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
452 stars 81 forks source link

Sequence XX doesn't overlap any reference sequences #23

Closed michoug closed 6 years ago

michoug commented 6 years ago

Hi Analysing different MAGs with potential plasmids, I get this error :

[2018-05-30 12:27:01] INFO: Placing 355 bacterial genomes into reference tree with pplacer (be patient).
Uncaught exception: Failure("Sequence N2F3_MBin.34 doesn't overlap any reference sequences.")
Fatal error: exception Failure("Sequence N2F3_MBin.34 doesn't overlap any reference sequences.")
Uncaught exception: Sys_error("All_out/pplacer/pplacer.bac120.json: No such file or directory")
Fatal error: exception Sys_error("All_out/pplacer/pplacer.bac120.json: No such file or directory")
GTDB-Tk has stopped before finishing

Would it be possible to skip the sequence without shutting down the software ? Cheers Greg

donovan-h-parks commented 6 years ago

Hey Greg. The error message and subsequent termination is being caused by pplacer. I don't have any direct control over this.

michoug commented 6 years ago

Hi Donovan Thanks for your answer. I resolved the issue by seeing that there was only gaps in the alignment and then manually removing the files. I suppose that it's because none of the 120 genes was found. Would it be possible to check this before the pplacer step and remove the sequence from the rest of the analysis ? Greg

donovan-h-parks commented 6 years ago

Hey Greg. I think it must be something more subtle than that. CheckM will default to using a "universal" set of markers if a genome contains less than 10 unique marker genes (see --unique flag).

michoug commented 6 years ago

Hey Donovan, I was not aware that GtdbTk was using checkM ! I didn't find the --unique flag in the GtdbTk script. Looking at the gtdbtk_bac120_markers_summary file, there are some case when no unique_genes or number_multiple_genes are found

gtdbtk_bac120_markers_summary.txt

donovan-h-parks commented 6 years ago

Hey Greg. Sorry, too many different programs these days. The GTDB-Tk has the flag "--min_perc_aa". If you set this to something greater than zero does it filter these out for you?

michoug commented 6 years ago

Yes it does, here is one of the lines of the log

[2018-06-01 11:26:43] INFO: 18 user genomes have amino acids in <1.0% of columns in filtered MSA.

donovan-h-parks commented 6 years ago

Great. I think we will actually change the default to 50% in the next release.