Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
460 stars 82 forks source link

NO results_with Classify_wf #539

Closed ramnageena11 closed 10 months ago

ramnageena11 commented 1 year ago

Hi,

  1. When i used contigs file, no result obtained.
  2. But when a subset was made and used for analysis then following result obtained.
  3. But No data in warnings folder.

Pls suggest. Thanks Ram

G gtdbtk classify_wf --genome_dir /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/s2mags_contigs_16S/ --out_dir /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/gtdb-tk-S2mags_classify --keep_intermediates --cpu 32 -x fasta --mash_db /home/majorram/anaconda3/envs/gtdb-tk-1/share/gtdbtk-2.3.2/db/mash_db.msh [2023-07-27 10:25:52] INFO: GTDB-Tk v2.3.2 [2023-07-27 10:25:52] INFO: gtdbtk classify_wf --genome_dir /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/s2mags_contigs_16S/ --out_dir /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/gtdb-tk-S2mags_classify --keep_intermediates --cpu 32 -x fasta --mash_db /home/majorram/anaconda3/envs/gtdb-tk-1/share/gtdbtk-2.3.2/db/mash_db.msh [2023-07-27 10:25:52] INFO: Using GTDB-Tk reference data version r214: /home/majorram/anaconda3/envs/gtdb-tk-1/share/gtdbtk-2.3.2/db [2023-07-27 10:25:53] INFO: Loading reference genomes. [2023-07-27 10:25:53] INFO: Using Mash version 2.3 [2023-07-27 10:25:53] INFO: Creating Mash sketch file: /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/gtdb-tk-S2mags_classify/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh [2023-07-27 10:26:01] INFO: Completed 49 genomes in 7.85 seconds (6.24 genomes/second). [2023-07-27 10:26:01] INFO: Loading data from existing Mash sketch file: /home/majorram/anaconda3/envs/gtdb-tk-1/share/gtdbtk-2.3.2/db/mash_db.msh [2023-07-27 10:26:06] INFO: Calculating Mash distances. [2023-07-27 10:26:17] INFO: Calculating ANI with FastANI v1.32. [2023-07-27 10:26:28] INFO: Completed 196 comparisons in 10.95 seconds (17.90 comparisons/second). [2023-07-27 10:26:33] INFO: Summary of results saved to: /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/gtdb-tk-S2mags_classify/classify/ani_screen/gtdbtk.bac120.ani_summary.tsv [2023-07-27 10:26:33] INFO: 2 genome(s) have been classified using the ANI pre-screening step. [2023-07-27 10:26:33] INFO: Done. [2023-07-27 10:26:33] INFO: Identifying markers in 47 genomes with 32 threads. [2023-07-27 10:26:33] TASK: Running Prodigal V2.6.3 to identify genes. [2023-07-27 10:26:45] INFO: Completed 47 genomes in 12.07 seconds (3.89 genomes/second). [2023-07-27 10:26:45] TASK: Identifying TIGRFAM protein families.
[2023-07-27 10:26:55] INFO: Completed 47 genomes in 9.36 seconds (5.02 genomes/second). [2023-07-27 10:26:55] TASK: Identifying Pfam protein families.
[2023-07-27 10:26:56] INFO: Completed 47 genomes in 0.67 seconds (70.11 genomes/second). [2023-07-27 10:26:56] INFO: Annotations done using HMMER 3.3.2 (Nov 2020).
[2023-07-27 10:26:56] TASK: Summarising identified marker genes. [2023-07-27 10:26:56] INFO: Completed 47 genomes in 0.16 seconds (294.50 genomes/second). [2023-07-27 10:26:56] INFO: Done. [2023-07-27 10:26:56] INFO: Aligning markers in 47 genomes with 32 CPUs. [2023-07-27 10:26:56] INFO: Processing 47 genomes identified as bacterial. [2023-07-27 10:27:03] INFO: Read concatenated alignment for 80,789 GTDB genomes. [2023-07-27 10:27:03] TASK: Generating concatenated alignment for each marker. [2023-07-27 10:27:06] INFO: Completed 47 genomes in 0.04 seconds (1,168.61 genomes/second). [2023-07-27 10:27:06] TASK: Aligning 116 identified markers using hmmalign 3.3.2 (Nov 2020). [2023-07-27 10:27:11] INFO: Completed 116 markers in 1.58 seconds (73.23 markers/second). [2023-07-27 10:27:11] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask. [2023-07-27 10:28:48] INFO: Completed 80,816 sequences in 1.61 minutes (50,227.78 sequences/minute). [2023-07-27 10:28:48] INFO: Masked bacterial alignment from 41,084 to 5,035 AAs. [2023-07-27 10:28:48] INFO: 22 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA. [2023-07-27 10:28:48] INFO: Creating concatenated alignment for 80,794 bacterial GTDB and user genomes. [2023-07-27 10:29:19] INFO: Creating concatenated alignment for 5 bacterial user genomes. [2023-07-27 10:29:20] INFO: Done. [2023-07-27 10:29:20] TASK: Placing 5 bacterial genomes into backbone reference tree with pplacer using 32 CPUs (be patient). [2023-07-27 10:29:20] INFO: pplacer version: v1.1.alpha19-0-g807f6f3 ==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/gtdb-tk-S2mags_classify/classify/intermediate_results/gtdbtk.bac1==> Step 2 of 9: Pre-masking sequences. [2023-07-27 10:31:19] INFO: Calculating RED values based on reference tree.
[2023-07-27 10:31:20] INFO: 5 out of 5 have an class assignments. Those genomes will be reclassified. [2023-07-27 10:31:20] TASK: Placing 3 bacterial genomes into class-level reference tree 1 (1/3) with pplacer using 32 CPUs (be patient). ==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/gtdb-tk-S2mags_classify/classify/intermediate_results/pplacer/tre==> Step 2 of 9: Pre-masking sequences. [2023-07-27 10:37:13] INFO: Calculating RED values based on reference tree.
[2023-07-27 10:37:23] TASK: Traversing tree to determine classification method. [2023-07-27 10:37:23] INFO: Completed 3 genomes in 0.00 seconds (4,503.55 genomes/second). [2023-07-27 10:37:23] TASK: Calculating average nucleotide identity using FastANI (v1.32). [2023-07-27 10:37:26] INFO: Completed 120 comparisons in 2.99 seconds (40.14 comparisons/second). [2023-07-27 10:37:27] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2023-07-27 10:37:27] TASK: Placing 1 bacterial genomes into class-level reference tree 3 (2/3) with pplacer using 32 CPUs (be patient). ==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/gtdb-tk-S2mags_classify/classify/intermediate_results/pplacer/tre==> Step 2 of 9: Pre-masking sequences. [2023-07-27 10:38:54] INFO: Calculating RED values based on reference tree.
[2023-07-27 10:38:58] TASK: Traversing tree to determine classification method. [2023-07-27 10:39:04] INFO: Completed 1 genome in 0.00 seconds (7,667.83 genomes/second). [2023-07-27 10:39:05] TASK: Calculating average nucleotide identity using FastANI (v1.32). [2023-07-27 10:39:05] INFO: Completed 2 comparisons in 0.63 seconds (3.16 comparisons/second). [2023-07-27 10:39:05] INFO: 0 genome(s) have been classified using FastANI and pplacer. [2023-07-27 10:39:06] TASK: Placing 1 bacterial genomes into class-level reference tree 5 (3/3) with pplacer using 32 CPUs (be patient). ==> Running pplacer v1.1.alpha19-0-g807f6f3 analysis on /media/majorram/Analysis_Data/singhrn/medaka_final2/s2mags_genomes/gtdb-tk-S2mags_classify/classify/intermediate_results/pplacer/tre==> Step 2 of 9: Pre-masking sequences. [2023-07-27 10:40:07] INFO: Calculating RED values based on reference tree.
[2023-07-27 10:40:09] TASK: Traversing tree to determine classification method. [2023-07-27 10:40:09] INFO: Completed 1 genome in 0.00 seconds (3,782.06 genomes/second). [2023-07-27 10:40:09] TASK: Calculating average nucleotide identity using FastANI (v1.32). [2023-07-27 10:40:10] INFO: Completed 20 comparisons in 0.79 seconds (25.39 comparisons/second). [2023-07-27 10:40:10] INFO: 0 genome(s) have been classified using FastANI and pplacer. [2023-07-27 10:40:17] WARNING: 43 of 27 genomes have a warning (see summary file). [2023-07-27 10:40:17] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode. [2023-07-27 10:40:17] INFO: Done.

pchaumeil commented 1 year ago

Hello Ram, I am not sure what is happening here, do you have a final summary file generated? Could you please provide the data you are trying to analyse? How complete/contaminated are this genomes?

Thanks, Pierre

ramnageena11 commented 1 year ago

Hi pchaumeil, I tried to send the files on email but failed. Here, files are not attaching.

How to share? Thanks Ram

pchaumeil commented 1 year ago

Do you have access to a cloud storage service ( dropbox, onedrive, google drive...)? you can upload your data there and send me the link to download them.

ramnageena11 commented 1 year ago

Hi, Yes, I do have google drive and will upload and share.

Thanks

On Thu, Aug 3, 2023 at 16:54 Pierre Chaumeil @.***> wrote:

Do you have access to a cloud storage service ( dropbox, onedrive, google drive...)? you can upload your data there and send me the link to download them.

— Reply to this email directly, view it on GitHub https://github.com/Ecogenomics/GTDBTk/issues/539#issuecomment-1664737065, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4LETTO57YKVCDDKBM3EZLXTQT2XANCNFSM6AAAAAA3AOV3DM . You are receiving this because you authored the thread.Message ID: @.***>

-- Ram Nageena Singh, Ph.D (Microbiology) Lab No. 3 Molecular Microbiology Lab Division of Microbiology ICAR-Indian Agricultural Research Institute Pusa, New Delhi-110012, Delhi, India

ramnageena11 commented 1 year ago

Hi pls find the link with files and folders https://drive.google.com/drive/folders/18o8vaa3VzF2ZAynFuw4E_jUP4p8Y91Jg?usp=drive_link

  1. gTDB-tk-1_S2_classify: result folder generated for s2_metagenomeAssembly.fasta
  2. s2_metagenomeAssembly.fasta : Assembled Metagenome contigs
  3. CheckM_summary_S2metagenome_table: Completenes and contamination analysis of metagenomes
  4. s2mags_contigs.fasta : MAGs contigs, a subset of s2metagenomeAssembly.fasta

I have total 4 metagenome assemblies to be analyzed. Let me know if you need more information.

Thanks rgds Ram

pchaumeil commented 1 year ago

Hello Ram, Looking at the high contamination in CheckM output file and GTDB-Tk output files, it seems you are trying to analyses a complete assembly for the entire sample. it hasn't been binned out into MAGs. This file will not give you any results in GTDB-Tk because all markers are duplicated. I would recommend running a tool like Metabat2 or MaxBin2 to bin your data before running Tk and CheckM.

ramnageena11 commented 1 year ago

Hi, Thanks, I'll look into it. For Metabat2 or Maxbin2: This assembly is generated from Nanopore long reads and could not fit in parameters for illumina to bin the sample. Regarding duplicate markers: Do you think, if I run a single genome file it will work? I checked a single genome in Checkm and it resulted in no marker genes, may be due to lower length of MAGs.

Thanks rgds

Ram

Ram Singh, Postdoctoral Research Scientist

Science Communication Fellow (SD Discovery Center)

Karen M. Swindler Department of Chemical & Biological Engineering

South Dakota Mines

501 E. Saint Joseph St., Rapid City, SD 57701

605.394.1730 | @.***

[image: South Dakota Mines] https://www.sdsmt.edu/

[image: South Dakota Mines on Facebook] https://www.facebook.com/SouthDakotaMines/[image: South Dakota Mines on Instagram] https://www.instagram.com/southdakotamines/[image: South Dakota Mines on Twitter] https://twitter.com/sdsmt[image: South Dakota Mines on Snapchat] https://www.snapchat.com/add/sdsmt

On Thu, Aug 3, 2023 at 10:08 PM Pierre Chaumeil @.***> wrote:

Hello Ram, Looking at the high contamination in CheckM output file and GTDB-Tk output files, it seems you are trying to analyses a complete assembly for the entire sample. it hasn't been binned out into MAGs. This file will not give you any results in GTDB-Tk because all markers are duplicated. I would recommend running a tool like Metabat2 or MaxBin2 to bin your data before running Tk and CheckM.

— Reply to this email directly, view it on GitHub https://github.com/Ecogenomics/GTDBTk/issues/539#issuecomment-1664940951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4LETT32JQ5MM7E463JCULXTRYT7ANCNFSM6AAAAAA3AOV3DM . You are receiving this because you authored the thread.Message ID: @.***>

donovan-h-parks commented 1 year ago

Hi Ram,

CheckM expects prokaryotic genomes (MAGs, SAGs, or isolate) as input. It will estimate the quality of these genomes using single-copy marker genes. Extremely poor quality genomes can result in no marker genes being identified.

Cheers, Donovan

ramnageena11 commented 1 year ago

Hi Donovan, No marker genes: does it depend on Genome length also?

I was not able to bin the metagenome as assembly was generated from Nanopore sequencing data and no software available for binning if Nanopore data.

I have extracted 16S rRNA genes from metagenome contigs (using barnap) and used them for BLAST but the same contig did not get identified on GTDB-Tk, though 16S rRNA gene is present. What will be the reason?

Thanks Ram

On Tue, Aug 8, 2023 at 16:50 Donovan H. Parks @.***> wrote:

Hi Ram,

CheckM expects prokaryotic genomes (MAGs, SAGs, or isolate) as input. It will estimate the quality of these genomes using single-copy marker genes. Extremely poor quality genomes can result in no marker genes being identified.

Cheers, Donovan

— Reply to this email directly, view it on GitHub https://github.com/Ecogenomics/GTDBTk/issues/539#issuecomment-1670416456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4LETSNUJDFBBGQVLBV54LXUK7BZANCNFSM6AAAAAA3AOV3DM . You are receiving this because you authored the thread.Message ID: @.***>

donovan-h-parks commented 1 year ago

Hi Ram,

Identifying genes and thus marker genes depends on the contigs provided to CheckM, but not directly on genome length. Both CheckM and GTDB-Tk assume the input is a genome. Providing individual contigs (unless they represent an appreciable portion of a genome) or full assemblies of a metagenomic sample is NOT recommended.

GTDB-TK does not consider 16S rRNA sequences. I encourage you to read the GTDB-Tk manuscript and documentation to better understand how it operates.

Cheers, Donovan

ramnageena11 commented 1 year ago

Hi Donavan,

Yes, My point was same. If the provided contig does not have Marker genes, it will not identify/classify. Since, I have limitations to bin Nanopore Data assembly, what do you suggest ? As I mentioned, I have extracted 16S rRNA as a strategy and used an alternative method to classify MAGs.

Thanks Ram

Ram Nageena Singh, Ph.D (Microbiology) Lab No. 3 Molecular Microbiology Lab Division of Microbiology ICAR-Indian Agricultural Research Institute Pusa, New Delhi-110012, Delhi, India

On Wed, Aug 9, 2023 at 12:05 PM Donovan H. Parks @.***> wrote:

Hi Ram,

Identifying genes and thus marker genes depends on the contigs provided to CheckM, but not directly on genome length. Both CheckM and GTDB-Tk assume the input is a genome. Providing individual contigs (unless they represent an appreciable portion of a genome) or full assemblies of a metagenomic sample is NOT recommended.

GTDB-TK does not consider 16S rRNA sequences. I encourage you to read the GTDB-Tk manuscript and documentation to better understand how it operates.

Cheers, Donovan

— Reply to this email directly, view it on GitHub https://github.com/Ecogenomics/GTDBTk/issues/539#issuecomment-1671901435, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4LETSICUD4O7VCQHRJJO3XUPGNLANCNFSM6AAAAAA3AOV3DM . You are receiving this because you authored the thread.Message ID: @.***>