KoslickiLab / YACHT

A mathematically characterized hypothesis test for organism presence/absence in a metagenome
MIT License
28 stars 7 forks source link

bug YAC-98 accepts .sig file now #92

Closed bioinfwithjudith closed 6 months ago

bioinfwithjudith commented 7 months ago

I was able to reproduce this error and fixed the condition on the type of file that is being accepted, so now a .sig.zip and .sig files can be accepted.

However, now I get the following error where the file will be extracted so a zipfile is expected.

Is this necessary in YACHT? Unsure how to move on from here.

2024-01-13 13:10:26 - INFO - Checking reference database file
2024-01-13 13:10:26 - INFO - Creating a temporary directory
2024-01-13 13:10:26 - INFO - Unzipping the sourmash signature file to the temporary directory
Traceback (most recent call last):
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/__init__.py", line 51, in main
    args.func(args)
  File "/data/jzr5814/repositories/YACHT/yacht/make_training_data_from_sketches.py", line 61, in main
    with zipfile.ZipFile(ref_file, 'r') as sourmash_db:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/zipfile/__init__.py", line 1338, in __init__
    self._RealGetContents()
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/zipfile/__init__.py", line 1405, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (15e943a) 80.30% compared to head (0d02661) 75.75%.

:exclamation: Current head 0d02661 differs from pull request most recent head 920e242. Consider uploading reports for the commit 920e242 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #92 +/- ## ========================================== - Coverage 80.30% 75.75% -4.56% ========================================== Files 21 15 -6 Lines 1488 1332 -156 ========================================== - Hits 1195 1009 -186 - Misses 293 323 +30 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dkoslicki commented 7 months ago

Quick question @jsrdrgz , did you test this on .sig, .lca, .sbt etc. to see if it behaves as intended? I recall @ShaopengLiu1 mentioning that LCA isn't a suitable database format for YACHT

sonarcloud[bot] commented 7 months ago

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

bioinfwithjudith commented 7 months ago

@dkoslicki

Apologies, I should have tested each one. 😅

Testing the different file formats, I run into the following issues:

the signatures file does not produce when using a .sig file

(yacht_env) hey there, jzr5814! YACHT:$ yacht train --ref_file /data/jzr5814/repositories/YACHT/tests/testdata/bug_YAC-98.sig --ksize 31 --num_threads 32 --ani_thresh 0.95 --prefix 'bug_YAC-98_sig_file' --outdir ./
2024-01-23 10:53:34 - INFO - Checking reference database file
.sig
2024-01-23 10:53:34 - INFO - Creating a temporary directory
2024-01-23 10:53:34 - INFO - Extracting signature information
Traceback (most recent call last):
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/__init__.py", line 51, in main
    args.func(args)
  File "/data/jzr5814/repositories/YACHT/yacht/make_training_data_from_sketches.py", line 68, in main
    sig_info_dict = utils.collect_signature_info(num_threads, ksize, path_to_temp_dir)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/utils.py", line 92, in collect_signature_info
    signatures = p.starmap(get_info_from_single_sig, [(os.path.join(path_to_temp_dir, 'signatures', file), ksize) for file in os.listdir(os.path.join(path_to_temp_dir, 'signatures'))])
                                                                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/jzr5814/repositories/YACHT/bug_YAC-98_sig_file_intermediate_files/signatures'

the sourmash multisearch cannot run for a .sig.zip file

(yacht_env) hey there, jzr5814! YACHT:$ yacht train --ref_file /data/jzr5814/repositories/YACHT/tests/testdata/sample.sig.zip --ksize 31 --num_threads 32 --ani_thresh 0.95 --prefix 'bug_YAC-98_sample_sig_zip_file' --outdir /data/jzr5814/repositories/YACHT/tests/testdata/
2024-01-23 10:56:23 - INFO - Checking reference database file
.zip
2024-01-23 10:56:23 - INFO - Creating a temporary directory
2024-01-23 10:56:23 - INFO - Unzipping the sourmash signature file to the temporary directory
2024-01-23 10:56:23 - INFO - Extracting signature information
100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11554.56it/s]
2024-01-23 10:56:24 - INFO - Checking if all signatures have the same scaled
2024-01-23 10:56:24 - INFO - Finding the closely related genomes with ANI > ani_thresh from the reference database
2024-01-23 10:56:24 - INFO - Running sourmash multisearch with command: sourmash scripts multisearch /data/jzr5814/repositories/YACHT/tests/testdata/bug_YAC-98_sample_sig_zip_file_intermediate_files/training_sig_files.txt /data/jzr5814/repositories/YACHT/tests/testdata/bug_YAC-98_sample_sig_zip_file_intermediate_files/training_sig_files.txt -k 31 -s 1000 -c 32 -t 0.2039068257457904 -o /data/jzr5814/repositories/YACHT/tests/testdata/bug_YAC-98_sample_sig_zip_file_intermediate_files/training_multisearch_result.csv
Traceback (most recent call last):
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/sourmash", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/__main__.py", line 10, in main
    args = sourmash.cli.parse_args(arglist)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/__init__.py", line 160, in parse_args
    return get_parser().parse_args(arglist)
           ^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/__init__.py", line 141, in get_parser
    getattr(sys.modules[__name__], op).subparser(sub)
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/scripts/__init__.py", line 48, in subparser
    _extension_dict.update(sourmash.plugins.add_cli_scripts(s))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/plugins.py", line 164, in add_cli_scripts
    subparser = parser.add_parser(script_cls.command,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/argparse.py", line 1214, in add_parser
    raise ArgumentError(self, _('conflicting subparser: %s') % name)
argparse.ArgumentError: argument subcmd: conflicting subparser: manysearch
Traceback (most recent call last):
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/__init__.py", line 51, in main
    args.func(args)
  File "/data/jzr5814/repositories/YACHT/yacht/make_training_data_from_sketches.py", line 78, in main
    multisearch_result = utils.run_multisearch(num_threads, ani_thresh, ksize, scale, path_to_temp_dir)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/utils.py", line 119, in run_multisearch
    raise ValueError(f"Error running sourmash multisearch with command: {cmd}")
ValueError: Error running sourmash multisearch with command: sourmash scripts multisearch /data/jzr5814/repositories/YACHT/tests/testdata/bug_YAC-98_sample_sig_zip_file_intermediate_files/training_sig_files.txt /data/jzr5814/repositories/YACHT/tests/testdata/bug_YAC-98_sample_sig_zip_file_intermediate_files/training_sig_files.txt -k 31 -s 1000 -c 32 -t 0.2039068257457904 -o /data/jzr5814/repositories/YACHT/tests/testdata/bug_YAC-98_sample_sig_zip_file_intermediate_files/training_multisearch_result.csv

I tried to sketch a fresh sig.zip file but cannot either

(yacht_env) hey there, jzr5814! testdata:$ sourmash sketch dna -f -p k=31,scaled=1000,abund -o bug_YAC-98.sig.zip bug_YAC-98.fastn 
Traceback (most recent call last):
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/sourmash", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/__main__.py", line 10, in main
    args = sourmash.cli.parse_args(arglist)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/__init__.py", line 160, in parse_args
    return get_parser().parse_args(arglist)
           ^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/__init__.py", line 141, in get_parser
    getattr(sys.modules[__name__], op).subparser(sub)
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/scripts/__init__.py", line 48, in subparser
    _extension_dict.update(sourmash.plugins.add_cli_scripts(s))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/plugins.py", line 164, in add_cli_scripts
    subparser = parser.add_parser(script_cls.command,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/argparse.py", line 1214, in add_parser
    raise ArgumentError(self, _('conflicting subparser: %s') % name)
argparse.ArgumentError: argument subcmd: conflicting subparser: manysearch
dkoslicki commented 7 months ago

Oof, well something is definitely wrong then, since we know from tests that the .sig and .sig.zip files should work. First, be sure that using the code before your changes those files worked. No point in testing them if they aren't already confirmed to work on the stable, working branch

ShaopengLiu1 commented 7 months ago

@jsrdrgz FYI, the sourmash python API fits all data format (including LCA). However, in yacht we specifically asked for single sketch input. Check this command (used in load sample sig): https://github.com/KoslickiLab/YACHT/blob/6b8b7c889acd4de91843e8906f684b181f0864c2/srcs/utils.py#L14

Therefore, David suggests only use sig or sig.zip for sample signatures.

mfl15 commented 6 months ago

hello @jsrdrgz , could you please tell what is the status of this task? Looks like pipeline is failing.