AnantharamanLab / METABOLIC

A scalable high-throughput metabolic and biogeochemical functional trait profiler
172 stars 42 forks source link

Empty Sankey Diagram & Lack of Taxonomy annotations by GTDB module #108

Closed Adrian-Zet closed 1 year ago

Adrian-Zet commented 1 year ago

Dear Metabolic Team,

I recently installed your pipeline and used it to map metatranscriptomic sequences against metagenome assembled genomes (that I also have already annotated and with a inferred taxonomy). It all works very well, except that the Sankey diagram and the resulting taxonomy (both in the GTDB_summary and in other related files) is missing or empty.

I only recieved this error message everytime I ran the script:

[2022-09-26 16:27:26] INFO: GTDB-Tk v1.6.0 [2022-09-26 16:27:26] INFO: gtdbtk classify_wf --cpus 80 -x fasta --genome_dir /home/paul/igridan/Metagenomics/01.Trimming_Assembly_Binning/05.ALL_MAGs/00.Definitive_MAGs/FF25/genomes-to-fasta/ --out_dir metabolic-FF2.5/intermediate_files/gtdbtk_Genome_files [2022-09-26 16:27:26] INFO: Using GTDB-Tk reference data version r207: /home/paul/adrz/metabolic/db/release207/ [2022-09-26 16:27:26] INFO: Identifying markers in 93 genomes with 80 threads. [2022-09-26 16:27:26] TASK: Running Prodigal V2.6.3 to identify genes. [2022-09-26 16:28:26] INFO: Completed 93 genomes in 1.00 minutes (92.71 genomes/minute). [2022-09-26 16:28:26] TASK: Identifying TIGRFAM protein families. [2022-09-26 16:28:40] INFO: Completed 93 genomes in 13.88 seconds (6.70 genomes/second). [2022-09-26 16:28:40] TASK: Identifying Pfam protein families. [2022-09-26 16:28:41] INFO: Completed 93 genomes in 0.98 seconds (94.96 genomes/second). [2022-09-26 16:28:41] INFO: Annotations done using HMMER 3.1b2 (February 2015). [2022-09-26 16:28:41] TASK: Summarising identified marker genes. [2022-09-26 16:28:43] INFO: Completed 93 genomes in 1.62 seconds (57.40 genomes/second). [2022-09-26 16:28:43] INFO: Done. [2022-09-26 16:28:43] ERROR: Uncontrolled exit resulting from an unexpected error.

================================================================================ EXCEPTION: FileNotFoundError MESSAGE: [Errno 2] No such file or directory: '/home/paul/adrz/metabolic/db/release207/markers/pfam/individual_hmms/PF01868.17.hmm'


Traceback (most recent call last): File "/home/paul/anaconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/main.py", line 95, in main gt_parser.parse_options(args) File "/home/paul/anaconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/main.py", line 735, in parse_options self.align(options) File "/home/paul/anaconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/main.py", line 291, in align markers.align(options.identify_dir, File "/home/paul/anaconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/markers.py", line 455, in align ar122_marker_info_file = MarkerInfoFileAR122(out_dir, prefix) File "/home/paul/anaconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/io/marker_info.py", line 74, in init super().init(path, AR122_MARKERS) File "/home/paul/anaconda3/envs/METABOLIC_v4.0/lib/python3.8/site-packages/gtdbtk/io/marker_info.py", line 33, in init self.markers = self._parse_markers(markers) File "/home/paul/anaconda3/envs/METABOLIC_v4 .0/lib/python3.8/site-packages/gtdbtk/io/marker_info.py", line 43, in _parse_markers with open(marker_path) as fh: FileNotFoundError: [Errno 2] No such file or directory: '/home/paul/adrz/metabolic/db/release207/markers/pfam/individual_hmms/PF01868.17.hmm'

I already checked other issues, and I made sure that none of the inputs have spaces (" ") in their names. Please find attached a text file with the input of the MAGs: input.txt

The command I used was to run the script was: perl ~/adrz/metabolic/run/METABOLIC/METABOLIC-C.pl -in-gn ~/igridan/Metagenomics/01.Trimming_Assembly_Binning/05.ALL_MAGs/00.Definitive_MAGs/FF25/genomes-to-fasta/ -r reads.txt -rt metaT -st illumina -t 80 -o metabolic-FF2.5

Everything else works as intended. I need either a solution on how to fix the GTDB issue, or a way to provide the already inferred taxonomy to the pipeline and skip that step all-together.

Desktop (please complete the following information):

ChaoLab commented 1 year ago

Hi, Can you test GTDB-Tk within the conda environment of METABOLIC? I noticed that you are using v1.6.0 while with a database of release 207. It seems to be conflicted with this requirement: https://ecogenomics.github.io/GTDBTk/installing/index.html (See the table in the bottom)

Adrian-Zet commented 1 year ago

Hello,

The version is indeed v1.6.0 with the database 207. I followed the instructions for the conda installation for Metabolic. I also checked the YAML config file I used in the conda installation of METABOLIC and for gtdb it lists this: "- gtdbtk=1.6.0=pyhdfd78af_0"

Edit: The GTDBTK from the METABOLIC Environment is indeed unusable. I'm trying now to update the version to 2.0.0 (compatible with 207 release of DB). Yet, I think I will need to make a new conda environment with a modified YAML to install GTDB 2.0.0 from the start. Since the current environment might get corrupted.

Should I try to change the gtdbtk installed or use a different database?

EDIT: Dear Metabolic Team, after changing the YAML of the conda installation to install GTDBTK version 2.0.0 with Database 207 I finally managed to kill all the bugs in the pipeline. There's still some warning regarding some cat function, yet it seems to be minor. Thanks for all the help. I leave this here in case other users experience the same issue.