Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
460 stars 82 forks source link

Not all genomes classified #356

Closed MrOlm closed 2 years ago

MrOlm commented 2 years ago

Hello,

I ran GTDB-Tk v1.7.0 on 246 genomes using the command:

gtdbtk classify_wf --genome_dir ${LOCAL}/Genomes/ --out_dir ${LOCAL}/GTDB_OUT --extension $binExtension --cpus $coreNum,

but only 238 ended up in the final file gtdbtk.bac120.summary.tsv. Below is the log file (gtdbtk.log). In it you can see that 246 are being used for most of the time the program is running, but it switches to 238 after pplacer.

What happened to those other 8 genomes? The file gtdbtk.failed_genomes.tsv is empty, so I don't think they failed. Is it possible for pplacer to fail for a genome? And if so, is fastANI classification still attempted?

Thanks in advance for your help, Matt

[2021-10-19 16:48:50] INFO: GTDB-Tk v1.7.0
[2021-10-19 16:48:50] INFO: gtdbtk classify_wf --genome_dir /mnt/Genomes/ --out_dir /mnt/GTDB_OUT --extension fa --cpus 16
[2021-10-19 16:48:50] INFO: Using GTDB-Tk reference data version r202: /mnt/GTDB_DB
[2021-10-19 16:48:50] INFO: Identifying markers in 246 genomes with 16 threads.
[2021-10-19 16:48:50] TASK: Running Prodigal V2.6.3 to identify genes.
[2021-10-19 16:52:48] INFO: Completed 246 genomes in 3.97 minutes (61.95 genomes/minute).
[2021-10-19 16:52:48] TASK: Identifying TIGRFAM protein families.
[2021-10-19 16:54:22] INFO: Completed 246 genomes in 1.57 minutes (157.14 genomes/minute).
[2021-10-19 16:54:22] TASK: Identifying Pfam protein families.
[2021-10-19 16:54:33] INFO: Completed 246 genomes in 10.61 seconds (23.18 genomes/second).
[2021-10-19 16:54:33] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2021-10-19 16:54:33] TASK: Summarising identified marker genes.
[2021-10-19 16:54:39] INFO: Completed 246 genomes in 6.13 seconds (40.12 genomes/second).
[2021-10-19 16:54:39] INFO: Done.
[2021-10-19 16:54:39] INFO: Aligning markers in 246 genomes with 16 CPUs.
[2021-10-19 16:54:39] INFO: Processing 246 genomes identified as bacterial.
[2021-10-19 16:54:44] INFO: Read concatenated alignment for 45,555 GTDB genomes.
[2021-10-19 16:54:44] TASK: Generating concatenated alignment for each marker.
[2021-10-19 16:54:46] INFO: Completed 246 genomes in 0.44 seconds (559.53 genomes/second).
[2021-10-19 16:54:46] TASK: Aligning 120 identified markers using hmmalign 3.1b2 (February 2015).
[2021-10-19 16:55:04] INFO: Completed 120 markers in 17.03 seconds (7.05 markers/second).
[2021-10-19 16:55:04] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2021-10-19 16:56:30] INFO: Completed 45,801 sequences in 1.43 minutes (32,113.68 sequences/minute).
[2021-10-19 16:56:30] INFO: Masked bacterial alignment from 41,084 to 5,037 AAs.
[2021-10-19 16:56:30] INFO: 0 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2021-10-19 16:56:30] INFO: Creating concatenated alignment for 45,801 bacterial GTDB and user genomes.
[2021-10-19 16:56:30] INFO: Creating concatenated alignment for 246 bacterial user genomes.
[2021-10-19 16:56:31] INFO: Done.
[2021-10-19 16:56:31] TASK: Placing 246 bacterial genomes into reference tree with pplacer using 16 CPUs (be patient).
[2021-10-19 16:56:31] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2021-10-19 18:10:07] INFO: Calculating RED values based on reference tree.
[2021-10-19 18:10:20] TASK: Traversing tree to determine classification method.
[2021-10-19 18:10:20] INFO: Completed 238 genomes in 0.33 seconds (710.70 genomes/second).
[2021-10-19 18:10:37] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2021-10-19 18:20:50] INFO: Completed 16,014 comparisons in 10.21 minutes (1,568.58 comparisons/minute).
[2021-10-19 18:20:52] INFO: 220 genome(s) have been classified using FastANI and pplacer.
[2021-10-19 18:20:52] INFO: Done.
donovan-h-parks commented 2 years ago

Hi Matt,

We just recently ran into this situation. It appears pplacer can randomly drop genomes as you expected. Best we can tell there is no reason for this though (i.e. the genomes are actually fine and can be placed by pplacer). If you try to re-run the missing genomes my guess is that they will run through pplacer and GTDB-Tk without issue. I fully appreciate this is very annoying. Even worse, these dropped genomes are currently silently lost by GTDB-Tk since we didn't expect pplacer to drop genomes. We are looking to at least build in a warning when this happens, but obviously we'd like to better understand why this occurs and hopefully find a solution/workaround.

The dataset where we've previous observed this issue is proprietary and was extremely large so we haven't been able to explore this issue in detail.

Any chance you can send us your 246 genomes so we can investigate further?

Thanks, Donovan

MrOlm commented 2 years ago

OK great- thanks for the quick response and explanation Donovan. Happy to send along the genomes- sending a followup email with download link now.

Best, Matt

pchaumeil commented 2 years ago

Hi Matt, I have run the 250 genomes from your shared file and they all received a final GTDB-Tk taxonomy. It seems this pplacer bug is random and does not occur every time. Could you please rerun your command one more time to verify if you get the same results (i.e missing the same genomes)?

Thanks, Pierre

MrOlm commented 2 years ago

Hi Pierre,

I can confirm that by re-running I was able to get taxonomic calls for all genomes. What a weird stochastic bug!

Best, Matt

donovan-h-parks commented 2 years ago

Thanks Matt. This is a real nightmare on our end. If you run into the issue again please let us know. It be good to get a sense of how often this is happening. We will work to at least put a clear warning/error message in that indicates genomes have been "randomly" skipped.

mhyleung commented 2 years ago

Dear all

I have encountered a similar problem, where I started the analysis with 49 genomes but only 38 genomes were included for the concatenated alignment. I notice that the genomes that were NOT included tend to be smaller than the other ones, but I also had small genomes that were included in the entire process, so genome size alone does not appear to be the issue here. Is there a cutoff where if a particular genome does not meet in the alignment step, then it gets removed from the next GTDB-Tk steps?

In my case, repeating the run does not solve the issue.

The line in question is under[2021-12-01 01:55:49] INFO: Creating concatenated alignment for 38 bacterial user genomes. as below...

[2021-12-01 01:50:14] INFO: Using GTDB-Tk reference data version r202: /mypath/databases/gtdbtk_R202/release202
[2021-12-01 01:50:14] INFO: Identifying markers in 49 genomes with 10 threads.
[2021-12-01 01:50:14] TASK: Running Prodigal V2.6.3 to identify genes.
[2021-12-01 01:52:57] INFO: Completed 49 genomes in 2.71 minutes (18.10 genomes/minute).
[2021-12-01 01:52:57] TASK: Identifying TIGRFAM protein families.
[2021-12-01 01:53:28] INFO: Completed 49 genomes in 31.52 seconds (1.55 genomes/second).
[2021-12-01 01:53:28] TASK: Identifying Pfam protein families.
[2021-12-01 01:53:33] INFO: Completed 49 genomes in 4.18 seconds (11.71 genomes/second).
[2021-12-01 01:53:33] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2021-12-01 01:53:33] TASK: Summarising identified marker genes.
[2021-12-01 01:53:34] INFO: Completed 49 genomes in 1.34 seconds (36.63 genomes/second).
[2021-12-01 01:53:34] INFO: Done.
[2021-12-01 01:53:35] INFO: Aligning markers in 49 genomes with 10 CPUs.
[2021-12-01 01:53:35] INFO: Processing 48 genomes identified as bacterial.
[2021-12-01 01:53:56] INFO: Read concatenated alignment for 45,555 GTDB genomes.
[2021-12-01 01:53:56] TASK: Generating concatenated alignment for each marker.
[2021-12-01 01:53:57] INFO: Completed 48 genomes in 0.33 seconds (144.23 genomes/second).
[2021-12-01 01:53:58] TASK: Aligning 120 identified markers using hmmalign 3.1b2 (February 2015).
[2021-12-01 01:54:00] INFO: Completed 120 markers in 0.71 seconds (170.01 markers/second).
[2021-12-01 01:54:00] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2021-12-01 01:55:49] INFO: Completed 45,603 sequences in 1.82 minutes (25,124.51 sequences/minute).
[2021-12-01 01:55:49] INFO: Masked bacterial alignment from 41,084 to 5,037 AAs.
[2021-12-01 01:55:49] INFO: 10 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2021-12-01 01:55:49] INFO: Creating concatenated alignment for 45,593 bacterial GTDB and user genomes.
[2021-12-01 01:55:49] INFO: Creating concatenated alignment for 38 bacterial user genomes.
[2021-12-01 01:55:49] INFO: Processing 1 genomes identified as archaeal.
[2021-12-01 01:55:50] INFO: Read concatenated alignment for 2,339 GTDB genomes.
[2021-12-01 01:55:50] TASK: Generating concatenated alignment for each marker.
[2021-12-01 01:55:51] INFO: Completed 1 genome in 0.01 seconds (147.21 genomes/second).
[2021-12-01 01:55:51] TASK: Aligning 49 identified markers using hmmalign 3.1b2 (February 2015).
[2021-12-01 01:55:52] INFO: Completed 49 markers in 0.27 seconds (184.10 markers/second).
[2021-12-01 01:55:52] TASK: Masking columns of archaeal multiple sequence alignment using canonical mask.
[2021-12-01 01:55:58] INFO: Completed 2,340 sequences in 5.76 seconds (406.16 sequences/second).
[2021-12-01 01:55:58] INFO: Masked archaeal alignment from 32,754 to 5,124 AAs.
[2021-12-01 01:55:58] INFO: 0 archaeal user genomes have amino acids in <10.0% of columns in filtered MSA.
[2021-12-01 01:55:58] INFO: Creating concatenated alignment for 2,340 archaeal GTDB and user genomes.
[2021-12-01 01:55:58] INFO: Creating concatenated alignment for 1 archaeal user genomes.
[2021-12-01 01:55:59] INFO: Done.
[2021-12-01 01:55:59] TASK: Placing 1 archaeal genomes into reference tree with pplacer using 10 CPUs (be patient).
[2021-12-01 01:55:59] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2021-12-01 01:56:34] INFO: Calculating RED values based on reference tree.
[2021-12-01 01:56:34] TASK: Traversing tree to determine classification method.
[2021-12-01 01:56:34] INFO: Completed 1 genome in 0.00 seconds (9,664.29 genomes/second).
[2021-12-01 01:56:34] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2021-12-01 01:56:35] INFO: Completed 2 comparisons in 0.20 seconds (10.00 comparisons/second).
[2021-12-01 01:56:35] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2021-12-01 01:56:35] TASK: Placing 38 bacterial genomes into reference tree with pplacer using 10 CPUs (be patient).
[2021-12-01 01:56:35] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2021-12-01 02:51:35] INFO: Calculating RED values based on reference tree.
[2021-12-01 02:51:43] TASK: Traversing tree to determine classification method.
[2021-12-01 02:51:44] INFO: Completed 38 genomes in 0.01 seconds (2,994.97 genomes/second).
[2021-12-01 02:51:46] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2021-12-01 02:52:34] INFO: Completed 646 comparisons in 48.14 seconds (13.42 comparisons/second).
[2021-12-01 02:52:34] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2021-12-01 02:52:34] INFO: Done.

Thank you very much!

Marcus

jmtsuji commented 2 years ago

I've encountered the same problem using GTDB-Tk version 1.5.1 -- when classifying 66 metagenome-assembled genomes (62 bacterial, 4 archaeal), several of the bacterial genomes fail. I re-ran GTDB-Tk twice, and each time, some of the genomes failed (although the failing genomes were not consistent). Two of the genomes failed to classify in all three of the full GTDB-Tk runs, so I then ran GTDB-Tk just on those two genomes, and they were classified without issue.

I agree that at least a warning message that some genomes failed would be helpful. It might also be possible to subset the failed genomes and run them through pplacer again, with reasonable chance of success.

Unfortunately, the dataset I'm working with is unpublished, and I'm not able to share it at this time (as much as I'd like to!). Thanks for all your work on this helpful tool.

lfenske-93 commented 2 years ago

Hi 😄

I'm not quite sure if it's the same problem, but something similar is occurring for me in v2.0.0 as well. Initially 1000 isolates were submitted, 6 of which may be excluded due to their low AA content, but then only 989 genomes are processed. *warnings.log as well as *failed.genomes.tsv are empty. I have repeated the whole thing once so far, but with no change.

Any idea what the reason could be? The log output follows below.

[2022-04-12 10:57:43] INFO: GTDB-Tk v2.0.0
[2022-04-12 10:57:43] INFO: gtdbtk classify_wf --genome_dir /Assemblies/batch_000/ --out_dir gtdb/ --extension gz --cpus 4
[2022-04-12 10:57:43] INFO: Using GTDB-Tk reference data version r207: /gtdb/release207/
[2022-04-12 10:57:44] INFO: Identifying markers in 1,000 genomes with 4 threads.
[2022-04-12 10:57:44] TASK: Running Prodigal V2.6.3 to identify genes.
[2022-04-12 12:16:30] INFO: Completed 1,000 genomes in 78.76 minutes (12.70 genomes/minute).
[2022-04-12 12:16:30] TASK: Identifying TIGRFAM protein families.
[2022-04-12 12:48:29] INFO: Completed 1,000 genomes in 31.98 minutes (31.27 genomes/minute).
[2022-04-12 12:48:29] TASK: Identifying Pfam protein families.
[2022-04-12 12:50:03] INFO: Completed 1,000 genomes in 1.58 minutes (633.88 genomes/minute).
[2022-04-12 12:50:03] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2022-04-12 12:50:03] TASK: Summarising identified marker genes.
[2022-04-12 12:50:52] INFO: Completed 1,000 genomes in 48.32 seconds (20.69 genomes/second).
[2022-04-12 12:50:52] INFO: Done.
[2022-04-12 12:50:59] INFO: Aligning markers in 1,000 genomes with 4 CPUs.
[2022-04-12 12:51:00] INFO: Processing 1,000 genomes identified as bacterial.
[2022-04-12 12:51:15] INFO: Read concatenated alignment for 62,291 GTDB genomes.
[2022-04-12 12:51:15] TASK: Generating concatenated alignment for each marker.
[2022-04-12 12:51:22] INFO: Completed 1,000 genomes in 6.45 seconds (154.92 genomes/second).
[2022-04-12 12:51:22] TASK: Aligning 120 identified markers using hmmalign 3.1b2 (February 2015).
[2022-04-12 12:56:12] INFO: Completed 120 markers in 4.83 minutes (24.86 markers/minute).
[2022-04-12 12:56:13] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2022-04-12 12:57:47] INFO: Completed 63,286 sequences in 1.57 minutes (40,267.26 sequences/minute).
[2022-04-12 12:57:47] INFO: Masked bacterial alignment from 41,084 to 5,036 AAs.
[2022-04-12 12:57:47] INFO: 6 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2022-04-12 12:57:47] INFO: Creating concatenated alignment for 63,280 bacterial GTDB and user genomes.
[2022-04-12 12:58:07] INFO: Creating concatenated alignment for 989 bacterial user genomes.
[2022-04-12 12:58:08] INFO: Done.
[2022-04-12 12:58:09] TASK: Placing 989 bacterial genomes into backbone reference tree with pplacer using 4 CPUs (be patient).
[2022-04-12 12:58:09] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2022-04-12 13:18:31] INFO: Calculating RED values based on reference tree.
[2022-04-12 13:18:35] INFO: 989 out of 989 have an order assignments. Those genomes will be reclassified.
[2022-04-12 13:18:35] TASK: Placing 547 bacterial genomes into order-level reference tree 6 (1/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 13:29:57] INFO: Calculating RED values based on reference tree.
[2022-04-12 13:29:58] TASK: Traversing tree to determine classification method.
[2022-04-12 13:29:58] INFO: Completed 547 genomes in 0.38 seconds (1,421.39 genomes/second).
[2022-04-12 13:30:07] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 15:18:03] INFO: Completed 18,588 comparisons in 107.91 minutes (172.26 comparisons/minute).
[2022-04-12 15:18:04] INFO: 544 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 15:18:04] TASK: Placing 159 bacterial genomes into order-level reference tree 11 (2/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 15:23:45] INFO: Calculating RED values based on reference tree.
[2022-04-12 15:23:46] TASK: Traversing tree to determine classification method.
[2022-04-12 15:23:47] INFO: Completed 159 genomes in 0.69 seconds (231.97 genomes/second).
[2022-04-12 15:23:53] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 16:14:33] INFO: Completed 21,080 comparisons in 50.65 minutes (416.21 comparisons/minute).
[2022-04-12 16:14:35] INFO: 157 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 16:14:35] TASK: Placing 68 bacterial genomes into order-level reference tree 1 (3/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 16:18:19] INFO: Calculating RED values based on reference tree.
[2022-04-12 16:18:20] TASK: Traversing tree to determine classification method.
[2022-04-12 16:18:20] INFO: Completed 68 genomes in 0.03 seconds (2,636.95 genomes/second).
[2022-04-12 16:18:21] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 16:24:38] INFO: Completed 2,968 comparisons in 6.27 minutes (473.01 comparisons/minute).
[2022-04-12 16:24:38] INFO: 68 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 16:24:38] TASK: Placing 53 bacterial genomes into order-level reference tree 9 (4/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 16:27:55] INFO: Calculating RED values based on reference tree.
[2022-04-12 16:27:56] TASK: Traversing tree to determine classification method.
[2022-04-12 16:27:56] INFO: Completed 53 genomes in 0.33 seconds (158.82 genomes/second).
[2022-04-12 16:28:02] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 17:30:31] INFO: Completed 10,142 comparisons in 62.46 minutes (162.38 comparisons/minute).
[2022-04-12 17:30:32] INFO: 51 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 17:30:32] TASK: Placing 44 bacterial genomes into order-level reference tree 15 (5/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 17:31:43] INFO: Calculating RED values based on reference tree.
[2022-04-12 17:31:43] TASK: Traversing tree to determine classification method.
[2022-04-12 17:31:43] INFO: Completed 44 genomes in 0.01 seconds (3,718.58 genomes/second).
[2022-04-12 17:31:44] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 17:34:48] INFO: Completed 1,494 comparisons in 3.05 minutes (489.45 comparisons/minute).
[2022-04-12 17:34:48] INFO: 44 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 17:34:48] TASK: Placing 32 bacterial genomes into order-level reference tree 2 (6/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 17:37:27] INFO: Calculating RED values based on reference tree.
[2022-04-12 17:37:27] TASK: Traversing tree to determine classification method.
[2022-04-12 17:37:27] INFO: Completed 32 genomes in 0.02 seconds (1,742.63 genomes/second).
[2022-04-12 17:37:30] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 17:59:52] INFO: Completed 2,554 comparisons in 22.36 minutes (114.23 comparisons/minute).
[2022-04-12 17:59:52] INFO: 31 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 17:59:52] TASK: Placing 25 bacterial genomes into order-level reference tree 18 (7/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 18:03:08] INFO: Calculating RED values based on reference tree.
[2022-04-12 18:03:09] TASK: Traversing tree to determine classification method.
[2022-04-12 18:03:09] INFO: Completed 25 genomes in 0.01 seconds (4,211.32 genomes/second).
[2022-04-12 18:03:11] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 18:06:20] INFO: Completed 740 comparisons in 3.15 minutes (234.69 comparisons/minute).
[2022-04-12 18:06:20] INFO: 24 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 18:06:20] TASK: Placing 19 bacterial genomes into order-level reference tree 5 (8/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 18:09:36] INFO: Calculating RED values based on reference tree.
[2022-04-12 18:09:37] TASK: Traversing tree to determine classification method.
[2022-04-12 18:09:37] INFO: Completed 19 genomes in 0.05 seconds (361.44 genomes/second).
[2022-04-12 18:09:43] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 18:18:57] INFO: Completed 1,892 comparisons in 9.23 minutes (205.02 comparisons/minute).
[2022-04-12 18:18:57] INFO: 18 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 18:18:57] TASK: Placing 17 bacterial genomes into order-level reference tree 8 (9/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 18:21:31] INFO: Calculating RED values based on reference tree.
[2022-04-12 18:21:32] TASK: Traversing tree to determine classification method.
[2022-04-12 18:21:32] INFO: Completed 17 genomes in 0.05 seconds (327.16 genomes/second).
[2022-04-12 18:21:37] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 19:01:04] INFO: Completed 3,336 comparisons in 39.45 minutes (84.56 comparisons/minute).
[2022-04-12 19:01:04] INFO: 17 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 19:01:05] TASK: Placing 14 bacterial genomes into order-level reference tree 7 (10/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 19:02:45] INFO: Calculating RED values based on reference tree.
[2022-04-12 19:02:46] TASK: Traversing tree to determine classification method.
[2022-04-12 19:02:46] INFO: Completed 14 genomes in 0.01 seconds (1,125.79 genomes/second).
[2022-04-12 19:02:48] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 19:06:18] INFO: Completed 880 comparisons in 3.49 minutes (251.92 comparisons/minute).
[2022-04-12 19:06:18] INFO: 12 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 19:06:18] TASK: Placing 4 bacterial genomes into order-level reference tree 14 (11/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 19:08:49] INFO: Calculating RED values based on reference tree.
[2022-04-12 19:08:50] TASK: Traversing tree to determine classification method.
[2022-04-12 19:08:50] INFO: Completed 4 genomes in 0.00 seconds (19,463.13 genomes/second).
[2022-04-12 19:08:50] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 19:08:53] INFO: Completed 14 comparisons in 2.86 seconds (4.90 comparisons/second).
[2022-04-12 19:08:53] INFO: 4 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 19:08:53] TASK: Placing 3 bacterial genomes into order-level reference tree 16 (12/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 19:09:55] INFO: Calculating RED values based on reference tree.
[2022-04-12 19:09:55] TASK: Traversing tree to determine classification method.
[2022-04-12 19:09:55] INFO: Completed 3 genomes in 0.00 seconds (9,293.14 genomes/second).
[2022-04-12 19:09:55] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 19:10:06] INFO: Completed 30 comparisons in 10.72 seconds (2.80 comparisons/second).
[2022-04-12 19:10:06] INFO: 3 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 19:10:06] TASK: Placing 2 bacterial genomes into order-level reference tree 12 (13/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 19:11:37] INFO: Calculating RED values based on reference tree.
[2022-04-12 19:11:38] TASK: Traversing tree to determine classification method.
[2022-04-12 19:11:38] INFO: Completed 2 genomes in 0.00 seconds (7,025.63 genomes/second).
[2022-04-12 19:11:38] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 19:11:42] INFO: Completed 28 comparisons in 4.21 seconds (6.64 comparisons/second).
[2022-04-12 19:11:42] INFO: 2 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 19:11:42] TASK: Placing 1 bacterial genomes into order-level reference tree 20 (14/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 19:14:21] INFO: Calculating RED values based on reference tree.
[2022-04-12 19:14:22] TASK: Traversing tree to determine classification method.
[2022-04-12 19:14:22] INFO: Completed 1 genome in 0.00 seconds (2,734.23 genomes/second).
[2022-04-12 19:14:22] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 19:14:31] INFO: Completed 34 comparisons in 8.87 seconds (3.83 comparisons/second).
[2022-04-12 19:14:31] INFO: 1 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 19:14:31] TASK: Placing 1 bacterial genomes into order-level reference tree 10 (15/15) with pplacer using 4 CPUs (be patient).
[2022-04-12 19:17:18] INFO: Calculating RED values based on reference tree.
[2022-04-12 19:17:19] TASK: Traversing tree to determine classification method.
[2022-04-12 19:17:19] INFO: Completed 1 genome in 0.00 seconds (8,192.00 genomes/second).
[2022-04-12 19:17:19] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-12 19:17:20] INFO: Completed 4 comparisons in 1.14 seconds (3.50 comparisons/second).
[2022-04-12 19:17:20] INFO: 1 genome(s) have been classified using FastANI and pplacer.
[2022-04-12 19:17:20] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2022-04-12 19:17:20] INFO: Done.
[2022-04-12 19:17:21] INFO: Removing intermediate files.
[2022-04-12 19:17:38] INFO: Intermediate files removed.
[2022-04-12 19:17:38] INFO: Done.
pchaumeil commented 2 years ago

Hi Linda, Would it be possible to share your genomes with us?Maybe not the whole set but a small subset of genomes including the 5 disappearing ones

Thanks, Pierre

lfenske-93 commented 2 years ago

Hi Pierre,

In the meantime, I took another look at the missing genomes and tried to have them analyzed individually. It is the same error every time: WARNING: Identified 0 single copy bacterial hits. Which of course explains why they don't show up in the table later.

Nevertheless, it would of course be helpful if this warning were also displayed when analyzing a large number of genomes at once. Unfortunately, the genomes are not listed in the failed.genomes.tsv, which makes the search for them somewhat tedious. 😀

hjruscheweyh commented 2 years ago

Hi all

Seeing a similar behaviour (don't worry, just reporting, not taking over the issue :)).

I ran a set of 11k mags through gtdbtk v2.0.0. One genome was missing from the output which creates a bit of a nightmare when grepping for results. I restarted gtdbtk on the single genome and got:

[2022-04-18 10:11:08] INFO: Aligning markers in 1 genomes with 32 CPUs.
[2022-04-18 10:11:08] INFO: Processing 1 genomes identified as bacterial.
[2022-04-18 10:11:21] INFO: Read concatenated alignment for 62,291 GTDB genomes.
[2022-04-18 10:11:21] TASK: Generating concatenated alignment for each marker.
[2022-04-18 10:11:26] INFO: Completed 1 genome in 0.01 seconds (77.28 genomes/second).
[2022-04-18 10:11:26] TASK: Aligning 7 identified markers using hmmalign 3.1b2 (February 2015).
[2022-04-18 10:11:32] INFO: Completed 7 markers in 0.43 seconds (16.23 markers/second).
[2022-04-18 10:11:32] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2022-04-18 10:13:47] INFO: Completed 62,292 sequences in 2.25 minutes (27,688.77 sequences/minute).
[2022-04-18 10:13:48] INFO: Masked bacterial alignment from 41,084 to 5,036 AAs.
[2022-04-18 10:13:48] INFO: 1 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2022-04-18 10:13:48] INFO: Creating concatenated alignment for 62,291 bacterial GTDB and user genomes.
[2022-04-18 10:14:15] INFO: All bacterial user genomes have been filtered out.

The warnings file and the failed.genomes file were empty.

Couldn't you just report this genome as Bacteria or Unclassified in the output file?

Best, Hans

pchaumeil commented 2 years ago

In 2.1, All genomes are reported in the summary files. Also , a warning is raised if pplacer skips genomes

SJohnsonMayo commented 2 years ago

I have 2.1 installed and am getting this same empty file / missing sample -- is it fixed only in 2.1.1?

pchaumeil commented 2 years ago

Hello, are your genomes missing during the pplacer step? or are they missing after, and reported in the summary files? Could you please attach your gtdbtk.log file? Also, would it be possible to share the genomes you are trying to run?

From 2.1 missing genomes should be reported by either a warning in gtdbtk.log or as 'unclassified' in the summary files. so there is something going wrong here.

Thanks