Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
479 stars 82 forks source link

failed_genomes.tsv No such file or directory #533

Closed surh closed 1 year ago

surh commented 1 year ago

I'm trying to set up GTDBtk in our cluster. It halts after it tries to look for the failed_genomes.tsvfile. Details are below, but the file doesn't exist.

The error I get is

EXCEPTION: FileNotFoundError
  MESSAGE: [Errno 2] No such file or directory: 'genomes_taxonomy/identify/gtdbtk.failed_genomes.tsv'

Environment

I'm using a conda environment created explicitly (via mamba) with:

mamba create -n gtdbtk_2.2.3 -c conda-forge -c bioconda gtdbtk=2.2.3

Here is my environment list of packages

$ mamba list
# packages in environment at /mnt/atgc-d2/sur/modules/pkgs/mamba/main/envs/gtdbtk_2.2.3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
boost                     1.70.0           py38h9de70de_1    conda-forge
boost-cpp                 1.70.0               h7b93d67_3    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
capnproto                 0.10.2               h6239696_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
dendropy                  4.5.2              pyh3252c3a_0    bioconda
fastani                   1.32                 he1c1bb9_0    bioconda
fasttree                  2.1.11               hec16e2b_1    bioconda
gsl                       2.7                  he838d99_0    conda-forge
gtdbtk                    2.2.3              pyhdfd78af_1    bioconda
hmmer                     3.3.2                h87f3376_2    bioconda
icu                       67.1                 he1b5a44_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
mash                      2.3                  hd3113c8_4    bioconda
ncurses                   6.3                  h27087fc_1    conda-forge
numpy                     1.24.2           py38h10c12cc_0    conda-forge
openssl                   1.1.1t               h0b41bf4_0    conda-forge
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
pplacer                   1.1.alpha19          h9ee0642_2    bioconda
prodigal                  2.6.3                hec16e2b_4    bioconda
pydantic                  1.10.5           py38h1de0b5d_0    conda-forge
python                    3.8.15          h257c98d_0_cpython    conda-forge
python_abi                3.8                      3_cp38    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
setuptools                67.4.0             pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tqdm                      4.64.1             pyhd8ed1ab_0    conda-forge
typing-extensions         4.4.0                hd8ed1ab_0    conda-forge
typing_extensions         4.4.0              pyha770c72_0    conda-forge
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.13               h166bdaf_4    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge

Server information

Server is CentOS Linux release 7.9.2009 (Core), and I requested 100GB of RAM to run this process.

Debugging information

Here is log output I get

Thu Jul 13 18:33:33 CDT 2023
===== Beginning pipeline =====
[2023-07-13 18:33:38] INFO: GTDB-Tk v2.2.3
[2023-07-13 18:33:38] INFO: gtdbtk classify_wf --genome_dir indir/ --out_dir genomes_taxonomy --mash_db mash_db/ --pplacer_cpus 8
[2023-07-13 18:33:38] INFO: Using GTDB-Tk reference data version r207: /mnt/atgc-d2/sur/modules/pkgs/mamba/main/envs/gtdbtk_2.2.3/share/gtdbtk-2.2.3/db
[2023-07-13 18:33:39] INFO: Loading reference genomes.
[2023-07-13 18:33:40] INFO: Using Mash version 2.3
[2023-07-13 18:33:40] INFO: Loading data from existing Mash sketch file: genomes_taxonomy/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2023-07-13 18:33:40] INFO: Loading data from existing Mash sketch file: mash_db/gtdb_ref_sketch.msh
[2023-07-13 18:33:50] INFO: Calculating Mash distances.
[2023-07-13 18:34:09] INFO: Calculating ANI with FastANI v1.32.
[2023-07-13 18:36:44] INFO: Completed 26 comparisons in 2.58 minutes (10.06 comparisons/minute).
[2023-07-13 18:36:45] INFO: Summary of results saved to: genomes_taxonomy/classify/ani_screen/gtdbtk.bac120.ani_summary.tsv
[2023-07-13 18:36:45] INFO: 2 genome(s) have been classified using the ANI pre-screening step.
[2023-07-13 18:36:45] INFO: Done.
[2023-07-13 18:36:45] INFO: Identifying markers in 0 genomes with 1 threads.
[2023-07-13 18:36:45] TASK: Running Prodigal V2.6.3 to identify genes.
[2023-07-13 18:36:46] ERROR: Uncontrolled exit resulting from an unexpected error.

================================================================================
EXCEPTION: FileNotFoundError
  MESSAGE: [Errno 2] No such file or directory: 'genomes_taxonomy/identify/gtdbtk.failed_genomes.tsv'
________________________________________________________________________________

Traceback (most recent call last):
  File "/mnt/atgc-d2/sur/modules/pkgs/mamba/main/envs/gtdbtk_2.2.3/lib/python3.8/site-packages/gtdbtk/__main__.py", line 99, in main
    gt_parser.parse_options(args)
  File "/mnt/atgc-d2/sur/modules/pkgs/mamba/main/envs/gtdbtk_2.2.3/lib/python3.8/site-packages/gtdbtk/main.py", line 1108, in parse_options
    self.identify(options,classified_genomes)
  File "/mnt/atgc-d2/sur/modules/pkgs/mamba/main/envs/gtdbtk_2.2.3/lib/python3.8/site-packages/gtdbtk/main.py", line 316, in identify
    reports = markers.identify(genomes,
  File "/mnt/atgc-d2/sur/modules/pkgs/mamba/main/envs/gtdbtk_2.2.3/lib/python3.8/site-packages/gtdbtk/markers.py", line 205, in identify
    genome_dictionary = prodigal.run(genomes, tln_tables)
  File "/mnt/atgc-d2/sur/modules/pkgs/mamba/main/envs/gtdbtk_2.2.3/lib/python3.8/site-packages/gtdbtk/external/prodigal.py", line 231, in run
    fails = open(self.failed_genomes_file,'w')
FileNotFoundError: [Errno 2] No such file or directory: 'genomes_taxonomy/identify/gtdbtk.failed_genomes.tsv'
================================================================================
===== Pipeline done =====
Thu Jul 13 18:36:46 CDT 2023
[sur@chromatin today3]$ 
pchaumeil commented 1 year ago

Hello, It seems both genomes have been classified with ANI so there is no more genomes to run the pipeline. [2023-07-13 18:36:45] INFO: Identifying markers in 0 genomes with 1 threads. which causes GTDB-Tk to stop.

We will implement a better and more explicit exit of the program in the next release.

surh commented 1 year ago

oh, I completely missed that. Thank you!

I also see that this will cause me problems, as I was planning to have this as a step of a nextflow pipeline, but if it exits with an error the pipeline is going to stop as well. Is there any way around this while we wait for the next release?

I can imagine that --skip_ani_screen should force it to go through all the steps, though it would be unnecessarily slower.

pchaumeil commented 1 year ago

Hello, So I actually already fixed this problem in the latest release of GTDB-Tk :) .

>> gtdbtk classify_wf --genome_dir genomes/ --out_dir genomes_taxonomy --mash_db mash_db/ --pplacer_cpus 8 --cpus 30 
[2023-07-18 08:52:46] INFO: GTDB-Tk v2.3.2
[2023-07-18 08:52:46] INFO: gtdbtk classify_wf --genome_dir genomes/ --out_dir genomes_taxonomy --mash_db mash_db/ --pplacer_cpus 8 --cpus 30
[2023-07-18 08:52:46] INFO: Using GTDB-Tk reference data version r214: /srv/db/gtdbtk/official/release214
[2023-07-18 08:52:46] INFO: Loading reference genomes.
[2023-07-18 08:52:47] INFO: Using Mash version 2.3
[2023-07-18 08:52:48] INFO: Creating Mash sketch file: genomes_taxonomy/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2023-07-18 08:52:48] INFO: Completed 2 genomes in 0.14 seconds (13.83 genomes/second).
[2023-07-18 08:52:48] INFO: Creating Mash sketch file: mash_db/gtdb_ref_sketch.msh
[2023-07-18 09:06:04] INFO: Completed 85,205 genomes in 13.28 minutes (6,418.24 genomes/minute).
[2023-07-18 09:06:04] INFO: Calculating Mash distances.
[2023-07-18 09:06:10] INFO: Calculating ANI with FastANI v1.32.
[2023-07-18 09:06:13] INFO: Completed 40 comparisons in 3.29 seconds (12.16 comparisons/second).
[2023-07-18 09:06:14] INFO: Summary of results saved to: genomes_taxonomy/classify/ani_screen/gtdbtk.bac120.ani_summary.tsv
[2023-07-18 09:06:14] INFO: 2 genome(s) have been classified using the ANI pre-screening step.
[2023-07-18 09:06:14] INFO: Done.
[2023-07-18 09:06:14] INFO: All genomes have been classified by the ANI screening step, Identify and Align steps will be skipped.
[2023-07-18 09:06:15] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2023-07-18 09:06:15] INFO: Done.
[2023-07-18 09:06:15] INFO: Removing intermediate files.
[2023-07-18 09:06:15] INFO: Intermediate files removed.
[2023-07-18 09:06:15] INFO: Done.