KoslickiLab / YACHT

A mathematically characterized hypothesis test for organism presence/absence in a metagenome
MIT License
28 stars 7 forks source link

Error working with GTDB database #103

Closed OliverBryan closed 6 months ago

OliverBryan commented 6 months ago

Working with data from https://frl.publisso.de/data/frl:6425521/marine/short_read/marmgCAMI2_sample_0_reads.tar.gz as the sample data, both yacht train and yacht run run into a similar error using the GTDB database available using yacht download.

First I ran yacht sketch sample --infile ./sample/anonymous_reads.fq --kmer 31 --scaled 1000 --outfile sample.sig.zip and then I first attempted to use the pretrained gtdb database using yacht download pretrained_ref_db --database gtdb --db_version rs214 --k 31 --ani_thresh 0.9995 --outfolder ./. Then, after unzipping this and attempting to run yacht I got the following error:

~/YACHT/testing$ yacht run --json ./gtdb-rs214-reps.k31_0.9995_pretrained/gtdb-rs214-reps.k31_0.9995_config.json --sample_file sample.sig.zip --significance 0.99 --num_threads 32 --min_coverage_list 1 0.6 0.2 0.1 --out ./result.xlsx
2024-02-13 11:53:02 - INFO - Loading the manifest file generated from the training data.
2024-02-13 11:53:02 - INFO - Loading sample signature and its signature info.
2024-02-13 11:53:14 - INFO - Computing hypothesis recovery.
2024-02-13 11:53:14 - INFO - Removing existing temporary directory: /home/oliverbryan/YACHT/testing/sample_sample_intermediate_files
2024-02-13 11:53:14 - INFO - Unzipping the sample signature zip file
2024-02-13 11:53:15 - INFO - Running sourmash multisearch with command: sourmash scripts multisearch /home/oliverbryan/YACHT/testing/sample_sample_intermediate_files/sample_sig_file.txt /home/oliverbryan/YACHT/testing/sample_sample_intermediate_files/organism_sig_file.txt -s 1000 -k 31 -c 32 -t 0 -o /home/oliverbryan/YACHT/testing/sample_sample_intermediate_files/sample_multisearch_result.csv

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.8.6; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: 1000 / moltype: DNA / threshold: 0.0
warning: only 12 threads available, using 12
searching all sketches in '/home/oliverbryan/YACHT/testing/sample_sample_intermediate_files/sample_sig_file.txt' against '/home/oliverbryan/YACHT/testing/sample_sample_intermediate_files/organism_sig_file.txt' using 12 threads
Reading list of query paths from: '/home/oliverbryan/YACHT/testing/sample_sample_intermediate_files/sample_sig_file.txt'
Loaded 1 query signature(s)
Reading list of search paths from: '/home/oliverbryan/YACHT/testing/sample_sample_intermediate_files/organism_sig_file.txt'
Loaded 85205 search signature(s)
Processed 0 comparisons
DONE. Processed 85205 comparisons
...multisearch is done! results in '/home/oliverbryan/YACHT/testing/sample_sample_intermediate_files/sample_multisearch_result.csv'
  0%|                                                                                                                                                                                  | 0/54450 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/oliverbryan/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/__init__.py", line 67, in main
    args.func(args)
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/run_YACHT.py", line 120, in main
    manifest_list = hr.hypothesis_recovery(manifest, sample_info_set, path_to_genome_temp_dir, min_coverage_list, scale, ksize, significance, ani_thresh, num_threads)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/hypothesis_recovery_src.py", line 274, in hypothesis_recovery
    exclusive_hashes_info, manifest = get_exclusive_hashes(manifest, nontrivial_organism_names, sample_sig, ksize, path_to_genome_temp_dir)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/hypothesis_recovery_src.py", line 109, in get_exclusive_hashes
    sig = load_signature_with_ksize(os.path.join(path_to_genome_temp_dir, 'signatures', md5sum+SIG_SUFFIX), ksize)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/utils.py", line 43, in load_signature_with_ksize
    if math.isnan(sketches[0].minhash.mean_abundance):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: must be real number, not NoneType

In an attempt to fix this, I tried to train the gtdb database myself using yacht download default_ref_db --database gtdb --db_version rs214 --gtdb_type reps --k 31 --outfolder ./ref and then yacht train --ref_file ./ref/gtdb-rs214-reps.k31.zip --ksize 31 --num_threads 32 --ani_thresh 0.95 --prefix 'gtdb_ani_thresh_0.95' --outdir ./ and then I got the following error:

2024-02-13 12:50:17 - INFO - Checking reference database file
2024-02-13 12:50:17 - INFO - Creating a temporary directory
2024-02-13 12:50:17 - INFO - Unzipping the sourmash signature file to the temporary directory
2024-02-13 12:50:27 - INFO - Extracting signature information
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/utils.py", line 82, in get_info_from_single_sig
    sig = load_signature_with_ksize(sig_file, ksize)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/utils.py", line 43, in load_signature_with_ksize
    if math.isnan(sketches[0].minhash.mean_abundance):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: must be real number, not NoneType
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/oliverbryan/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/__init__.py", line 67, in main
    args.func(args)
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/make_training_data_from_sketches.py", line 66, in main
    sig_info_dict = utils.collect_signature_info(num_threads, ksize, path_to_temp_dir)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/yacht/utils.py", line 95, in collect_signature_info
    signatures = p.starmap(get_info_from_single_sig, [(os.path.join(path_to_temp_dir, 'signatures', file), ksize) for file in os.listdir(os.path.join(path_to_temp_dir, 'signatures'))])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/multiprocessing/pool.py", line 375, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
TypeError: must be real number, not NoneType
dkoslicki commented 6 months ago

@mfl15 do you mind helping @jsrdrgz out with this? From a slack conversation it looks like it might be a bit nuanced to fix this (as the previous fix to a bug introduced a new one) and definitely points to a missing integration test somewhere.

mfl15 commented 6 months ago

@dkoslicki sure, no problems