Error occured using gtdbtk de_novo_wf for bins

jojyjohn28 commented 10 months ago

Dear GTDB-Tk team,

I am using gtdbtk de_novo_wf to analyze a set of bin files, in order to obtain a tree file only inlcuding the genomes from the sample, with the following command:

gtdbtk de_novo_wf --genome_dir /zfs/camplab/Jojy/darpa_working/drep-output_directory/dereplicated_genomes --out_dir /zfs/camplab/Jojy/darpa_working/gtdbtk_oct2023/de_novo_new --extension fa --bacteria --gtdbtk_classification_file /zfs/camplab/Jojy/darpa_working/gtdbtk_oct2023/gtdbtk.bac120.summary.tsv --cpus 40 --outgroup_taxon p__Chloroflexota --skip_gtdb_refs --custom_taxonomy_file /zfs/camplab/Jojy/darpa_working/gtdbtk_oct2023/CUSTOM_TAXONOMY_FILE

These genomes have actually already been analyzed with classify_wf, with the taxonomy information obtained. So used gtdbtk.bac120.summary.tsv as --gtdbtk_classification_file and I made a custom_taxonomy file from the same summary. Both are attached with this.

During the running I got the following error.

[2023-11-20 12:55:35] INFO: Read custom taxonomy for 45 genomes. [2023-11-20 12:55:35] INFO: Reassigned taxonomy for 45 GTDB representative genomes. [2023-11-20 12:55:35] ERROR: GTDB-Tk classification and custom taxonomy files must not specify taxonomies for the same genomes. [2023-11-20 12:55:35] ERROR: These files have 45 genomes in common. [2023-11-20 12:55:35] ERROR: Example duplicate genome: bin.18 [2023-11-20 12:55:35] ERROR: Duplicated taxonomy information. [2023-11-20 12:55:35] ERROR: Controlled exit resulting from an unrecoverable error or warning.

My aim is to generate a tree with my bins from different samples. These all are de-replicated bins and when I closely checked the taxonomy files for duplication, some of them are duplicates but the genus/spices is different.

If I remove all duplicated bins from the custom-taxonomy-file,will it interfere the results or will it exclude all the removed bins from the tree??

Can you help me with this problem?

Thanks in advance, Jojy Custom_taxonomy_file.csv gtdbtk.bac120.summary(3).csv I am running the tool on HPC

## Environment

[ ] Using a conda environment (include the output of conda list && conda list --revisions)
[ ] conda list
packages in environment at /zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311:

#

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge biopython 1.81 pypi_0 pypi bitarray 2.7.3 pypi_0 pypi bzip2 1.0.8 h7f98852_4 conda-forge ca-certificates 2022.12.7 ha878542_0 conda-forge certifi 2023.5.7 pypi_0 pypi charset-normalizer 3.1.0 pypi_0 pypi checkm-genome 1.2.2 pypi_0 pypi contourpy 1.0.7 pypi_0 pypi cycler 0.11.0 pypi_0 pypi cython 0.29.34 pypi_0 pypi dendropy 4.5.2 pypi_0 pypi drep 3.4.3 pypi_0 pypi ete3 3.1.2 pypi_0 pypi fonttools 4.39.3 pypi_0 pypi gtdbtk 2.2.6 pypi_0 pypi idna 3.4 pypi_0 pypi iniconfig 2.0.0 pypi_0 pypi joblib 1.2.0 pypi_0 pypi kiwisolver 1.4.4 pypi_0 pypi ld_impl_linux-64 2.40 h41732ed_0 conda-forge libexpat 2.5.0 hcb278e6_1 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 12.2.0 h65d4601_19 conda-forge libgomp 12.2.0 h65d4601_19 conda-forge libnsl 2.0.0 h7f98852_0 conda-forge libsqlite 3.40.0 h753d276_0 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libzlib 1.2.13 h166bdaf_4 conda-forge mash 1.14 pypi_0 pypi matplotlib 3.7.1 pypi_0 pypi meme 2.0.0 pypi_0 pypi metachip 1.10.12 pypi_0 pypi munkres 1.1.4 pypi_0 pypi ncurses 6.3 h27087fc_1 conda-forge networkx 2.8.8 pypi_0 pypi numpy 1.24.2 pypi_0 pypi openssl 3.1.0 h0b41bf4_0 conda-forge packaging 23.1 pypi_0 pypi pandas 2.0.0 pypi_0 pypi pillow 9.5.0 pypi_0 pypi pip 23.0.1 pyhd8ed1ab_0 conda-forge pluggy 1.0.0 pypi_0 pypi pong 1.5 pypi_0 pypi pydantic 1.10.7 pypi_0 pypi pyfasta 0.5.2 pypi_0 pypi pyparsing 3.0.9 pypi_0 pypi pysam 0.21.0 pypi_0 pypi pytest 7.3.0 pypi_0 pypi python 3.11.3 h2755cc3_0_cpython conda-forge python-dateutil 2.8.2 pypi_0 pypi python-graphviz 0.20.1 pypi_0 pypi pytz 2023.3 pypi_0 pypi readline 8.2 h8228510_1 conda-forge reportlab 3.6.12 pypi_0 pypi requests 2.30.0 pypi_0 pypi scikit-learn 1.2.2 pypi_0 pypi scipy 1.10.1 pypi_0 pypi seaborn 0.12.2 pypi_0 pypi setuptools 67.6.1 pyhd8ed1ab_0 conda-forge six 1.16.0 pypi_0 pypi split-fasta 1.0.0 pypi_0 pypi sqlite 3.40.0 h4ff8645_0 conda-forge threadpoolctl 3.1.0 pypi_0 pypi tk 8.6.12 h27826a3_0 conda-forge tornado 6.3.3 pypi_0 pypi tqdm 4.65.0 pypi_0 pypi typing-extensions 4.5.0 pypi_0 pypi tzdata 2023.3 pypi_0 pypi urllib3 2.0.2 pypi_0 pypi wheel 0.40.0 pyhd8ed1ab_0 conda-forge xz 5.2.6 h166bdaf_0 conda-forge

conda list --revisions 2023-04-14 14:06:31 (rev 0) +_libgcc_mutex-0.1 (conda-forge/linux-64) +_openmp_mutex-4.5 (conda-forge/linux-64) +bzip2-1.0.8 (conda-forge/linux-64) +ca-certificates-2022.12.7 (conda-forge/linux-64) +ld_impl_linux-64-2.40 (conda-forge/linux-64) +libexpat-2.5.0 (conda-forge/linux-64) +libffi-3.4.2 (conda-forge/linux-64) +libgcc-ng-12.2.0 (conda-forge/linux-64) +libgomp-12.2.0 (conda-forge/linux-64) +libnsl-2.0.0 (conda-forge/linux-64) +libsqlite-3.40.0 (conda-forge/linux-64) +libuuid-2.38.1 (conda-forge/linux-64) +libzlib-1.2.13 (conda-forge/linux-64) +ncurses-6.3 (conda-forge/linux-64) +openssl-3.1.0 (conda-forge/linux-64) +pip-23.0.1 (conda-forge/noarch) +python-3.11.3 (conda-forge/linux-64) +readline-8.2 (conda-forge/linux-64) +setuptools-67.6.1 (conda-forge/noarch) +tk-8.6.12 (conda-forge/linux-64) +tzdata-2023c (conda-forge/noarch) +wheel-0.40.0 (conda-forge/noarch) +xz-5.2.6 (conda-forge/linux-64)

2023-04-14 14:07:23 (rev 1) +sqlite-3.40.0 (conda-forge/linux-64)

Server information

CPU grep -m 1 "^model name" /proc/cpuinfo model name : AMD EPYC 7543 32-Core Processor grep -c "^processor" /proc/cpuinfo 8
RAM: `grep "^MemTotal" /proc/meminfo I used 96 threads and 1.5tb memory for running this job

gtdbtk.log` has been included

gtdbtk.log

My aim is to create a tree like below (source Bandla et al.,2020)

pchaumeil commented 10 months ago

Hello,

GTDB-Tk does not accept duplicates in the taxonomy file. if the bin ids are present in both the custom taxonomy and classify_wf summary file , the software does not know which taxonomy string to select.

Removing the duplicated bins from the custom taxonomy file will not remove them from the tree as they are still in the classify_wf summary file. You will still have them in the final tree.

jojyjohn28 commented 10 months ago

Thank you so much for the reply.

I removed the classify_wf summary file and just used custom_taxonomy file. but now I ends up with another error as below.

================================================================================ EXCEPTION: TypeError MESSAGE: Population must be a sequence. For dicts or sets, use sorted(d).

Traceback (most recent call last): File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/main.py", line 101, in main gt_parser.parse_options(args) File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/main.py", line 1051, in parse_options self.root(options) File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/main.py", line 776, in root reports = reroot.root_with_outgroup(options.input_tree, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/site-packages/gtdbtk/reroot_tree.py", line 83, in root_with_outgroup rnd_ingroup = random.sample(ingroup_leaves, 1)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311/lib/python3.11/random.py", line 439, in sample raise TypeError("Population must be a sequence. " TypeError: Population must be a sequence. For dicts or sets, use sorted(d).

I am really new to GTDBtk. Can you please help me with this. I already update the aligner too. I am working on HPC.

Thank you in advance,

pchaumeil commented 10 months ago

Hello, Do you have the same issue when running GTDB-Tk with Python 3.8 instead of 3.11? I suspect there is a change in the random library in Python 3.11 which causes this problem. GTDB-Tk hasn't been tested with the latest Python releases but we will modify the code to deal with these recent changes.

hunglin59638 commented 10 months ago

Yes, it caused by the Python version.

Python 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> ingroup_leaves = {"a", "b"}
>>> random.sample(ingroup_leaves, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/XXX/mambaforge/envs/XXX/lib/python3.11/random.py", line 436, in sample
    raise TypeError("Population must be a sequence.  "
TypeError: Population must be a sequence.  For dicts or sets, use sorted(d).

Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> ingroup_leaves = {"a", "b"}
>>> random.sample(ingroup_leaves, 1)
<stdin>:1: DeprecationWarning: Sampling from a set deprecated
since Python 3.9 and will be removed in a subsequent version.
['a']

jojyjohn28 commented 10 months ago

I think the problem is associated with the python as both of you mentioned. I could not change the version of python as i am working on HPC. Instead I created another environment and installed latest version of GTDBTK , allighner (MSA) and tried to perform the same gtdbtk de_novo_wf - without classification summary. everything worked fine. at the end I converted the resulted tree to iToL format.

Yay !!! I could make the almost similar tree I quoted in my first question.

Thank you very much for this wonderful tool and timely response and help...

Thank you once again team..

Ecogenomics / GTDBTk

Error occured using gtdbtk de_novo_wf for bins #559

packages in environment at /zfs/gcl/software/gbf/anaconda3/2021.11/envs/py311:

Name Version Build Channel

Server information