Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
470 stars 82 forks source link

Error occured using gtdbtk de_novo_wf for bins #559

Closed jojyjohn28 closed 10 months ago

jojyjohn28 commented 10 months ago

Dear GTDB-Tk team,

I am using gtdbtk de_novo_wf to analyze a set of bin files, in order to obtain a tree file only inlcuding the genomes from the sample, with the following command:

gtdbtk de_novo_wf --genome_dir /zfs/camplab/Jojy/darpa_working/drep-output_directory/dereplicated_genomes --out_dir /zfs/camplab/Jojy/darpa_working/gtdbtk_oct2023/de_novo_new --extension fa --bacteria --gtdbtk_classification_file /zfs/camplab/Jojy/darpa_working/gtdbtk_oct2023/gtdbtk.bac120.summary.tsv --cpus 40 --outgroup_taxon p__Chloroflexota --skip_gtdb_refs --custom_taxonomy_file /zfs/camplab/Jojy/darpa_working/gtdbtk_oct2023/CUSTOM_TAXONOMY_FILE

These genomes have actually already been analyzed with classify_wf, with the taxonomy information obtained. So used gtdbtk.bac120.summary.tsv as --gtdbtk_classification_file and I made a custom_taxonomy file from the same summary. Both are attached with this.

During the running I got the following error.

[2023-11-20 12:55:35] INFO: Read custom taxonomy for 45 genomes. [2023-11-20 12:55:35] INFO: Reassigned taxonomy for 45 GTDB representative genomes. [2023-11-20 12:55:35] ERROR: GTDB-Tk classification and custom taxonomy files must not specify taxonomies for the same genomes. [2023-11-20 12:55:35] ERROR: These files have 45 genomes in common. [2023-11-20 12:55:35] ERROR: Example duplicate genome: bin.18 [2023-11-20 12:55:35] ERROR: Duplicated taxonomy information. [2023-11-20 12:55:35] ERROR: Controlled exit resulting from an unrecoverable error or warning.

My aim is to generate a tree with my bins from different samples. These all are de-replicated bins and when I closely checked the taxonomy files for duplication, some of them are duplicates but the genus/spices is different.

If I remove all duplicated bins from the custom-taxonomy-file,will it interfere the results or will it exclude all the removed bins from the tree??

Can you help me with this problem?

Thanks in advance, Jojy Custom_taxonomy_file.csv gtdbtk.bac120.summary(3).csv I am running the tool on HPC

## Environment

conda list --revisions 2023-04-14 14:06:31 (rev 0) +_libgcc_mutex-0.1 (conda-forge/linux-64) +_openmp_mutex-4.5 (conda-forge/linux-64) +bzip2-1.0.8 (conda-forge/linux-64) +ca-certificates-2022.12.7 (conda-forge/linux-64) +ld_impl_linux-64-2.40 (conda-forge/linux-64) +libexpat-2.5.0 (conda-forge/linux-64) +libffi-3.4.2 (conda-forge/linux-64) +libgcc-ng-12.2.0 (conda-forge/linux-64) +libgomp-12.2.0 (conda-forge/linux-64) +libnsl-2.0.0 (conda-forge/linux-64) +libsqlite-3.40.0 (conda-forge/linux-64) +libuuid-2.38.1 (conda-forge/linux-64) +libzlib-1.2.13 (conda-forge/linux-64) +ncurses-6.3 (conda-forge/linux-64) +openssl-3.1.0 (conda-forge/linux-64) +pip-23.0.1 (conda-forge/noarch) +python-3.11.3 (conda-forge/linux-64) +readline-8.2 (conda-forge/linux-64) +setuptools-67.6.1 (conda-forge/noarch) +tk-8.6.12 (conda-forge/linux-64) +tzdata-2023c (conda-forge/noarch) +wheel-0.40.0 (conda-forge/noarch) +xz-5.2.6 (conda-forge/linux-64)

2023-04-14 14:07:23 (rev 1) +sqlite-3.40.0 (conda-forge/linux-64)

Server information

gtdbtk.log` has been included

gtdbtk.log

My aim is to create a tree like below (source Bandla et al.,2020)

image

pchaumeil commented 10 months ago

Hello,

GTDB-Tk does not accept duplicates in the taxonomy file. if the bin ids are present in both the custom taxonomy and classify_wf summary file , the software does not know which taxonomy string to select.

Removing the duplicated bins from the custom taxonomy file will not remove them from the tree as they are still in the classify_wf summary file. You will still have them in the final tree.

jojyjohn28 commented 10 months ago

Thank you so much for the reply.

I removed the classify_wf summary file and just used custom_taxonomy file. but now I ends up with another error as below.

================================================================================ EXCEPTION: TypeError MESSAGE: Population must be a sequence. For dicts or sets, use sorted(d).

Traceback (most recent call last): File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/main.py", line 101, in main gt_parser.parse_options(args) File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/main.py", line 1051, in parse_options self.root(options) File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/main.py", line 776, in root reports = reroot.root_with_outgroup(options.input_tree, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/reroot_tree.py", line 83, in root_with_outgroup rnd_ingroup = random.sample(ingroup_leaves, 1)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/random.py", line 439, in sample raise TypeError("Population must be a sequence. " TypeError: Population must be a sequence. For dicts or sets, use sorted(d).

I am really new to GTDBtk. Can you please help me with this. I already update the aligner too. I am working on HPC.

Thank you in advance,

pchaumeil commented 10 months ago

Hello, Do you have the same issue when running GTDB-Tk with Python 3.8 instead of 3.11? I suspect there is a change in the random library in Python 3.11 which causes this problem. GTDB-Tk hasn't been tested with the latest Python releases but we will modify the code to deal with these recent changes.

hunglin59638 commented 10 months ago

Yes, it caused by the Python version.

Python 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> ingroup_leaves = {"a", "b"}
>>> random.sample(ingroup_leaves, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/XXX/mambaforge/envs/XXX/lib/python3.11/random.py", line 436, in sample
    raise TypeError("Population must be a sequence.  "
TypeError: Population must be a sequence.  For dicts or sets, use sorted(d).
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> ingroup_leaves = {"a", "b"}
>>> random.sample(ingroup_leaves, 1)
<stdin>:1: DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
['a']
jojyjohn28 commented 10 months ago

I think the problem is associated with the python as both of you mentioned. I could not change the version of python as i am working on HPC. Instead I created another environment and installed latest version of GTDBTK , allighner (MSA) and tried to perform the same gtdbtk de_novo_wf - without classification summary. everything worked fine. at the end I converted the resulted tree to iToL format.

Yay !!! I could make the almost similar tree I quoted in my first question.

Thank you very much for this wonderful tool and timely response and help...

Thank you once again team..