KoslickiLab / YACHT

A mathematically characterized hypothesis test for organism presence/absence in a metagenome
MIT License
28 stars 7 forks source link

Training doesn't like sig files #67

Open dkoslicki opened 7 months ago

dkoslicki commented 7 months ago

Traceback (most recent call last): File "/scratch/temp/Yacht_issue_12/YACHT/make_training_data_from_sketches.py", line 44, in raise ValueError(f"Reference database file {ref_file} is not a zip file. Please a Sourmash signature database file with Zipfile format.") ValueError: Reference database file /scratch/temp/Yacht_issue_12/97_Silva_111_rep_set_euk.fasta.sig is not a zip file. Please a Sourmash signature database file with Zipfile format.

bioinfwithjudith commented 6 months ago

@dkoslicki

The title here is a bit confusing to the error. YACHT does not like zip files or does not like sig files?

I was able to reproduce this error and fixed the condition on the type of file that is being accepted, so now a .sig.zip and .sig files can be accepted.

However, now I get the following error where the file will be extracted so a zipfile is expected.

Is this necessary in YACHT? Unsure how to move on from here.

2024-01-13 13:10:26 - INFO - Checking reference database file
2024-01-13 13:10:26 - INFO - Creating a temporary directory
2024-01-13 13:10:26 - INFO - Unzipping the sourmash signature file to the temporary directory
Traceback (most recent call last):
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/__init__.py", line 51, in main
    args.func(args)
  File "/data/jzr5814/repositories/YACHT/yacht/make_training_data_from_sketches.py", line 61, in main
    with zipfile.ZipFile(ref_file, 'r') as sourmash_db:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/zipfile/__init__.py", line 1338, in __init__
    self._RealGetContents()
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/zipfile/__init__.py", line 1405, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
dkoslicki commented 6 months ago

TODO:

  1. Obtain a .sig file, and see if the code will run on it (commenting out any hard coded "check for .zip ending" logic)
  2. If the code will run on a *.sig file, instead have a branching logic that when given a file:
    • If not a correct file extension (check with Shaopeng what the expected file extensions are), throw and error
    • If zip file, then unzip and proceed with rest of code
    • if non-zip file, then just proceed with the rest of the code
bioinfwithjudith commented 5 months ago

I used the demo files to test.. I am running into two issues:

(1) signature directory is not being created when running a sig file (2) multisearch issue when running yacht train on sig.sip file.

Could someone check if this works on their end? I'll just start from a blank slate if it is not an issue for others.

Issue 1

sketch ref genomes for a sig file

sourmash sketch dna ./demo/ref_genomes/* -p k=31,scaled=1000,abund -o ./demo/ref.sig

Running yacht train on a sig file

yacht train --ref_file ./demo/ref.sig --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'demo_ref_sig_test' --outdir ./demo/ --force

Returns an error where the signature directory is not being created.

prefix 'bug_YAC-98_edit_sig' --outdir ./tests/testdata/ --force
2024-02-01 12:04:10 - INFO - Checking reference database file
.sig
2024-02-01 12:04:10 - INFO - Creating a temporary directory
2024-02-01 12:04:10 - INFO - Extracting signature information
Traceback (most recent call last):
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/__init__.py", line 51, in main
    args.func(args)
  File "/data/jzr5814/repositories/YACHT/yacht/make_training_data_from_sketches.py", line 68, in main
    sig_info_dict = utils.collect_signature_info(num_threads, ksize, path_to_temp_dir)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/utils.py", line 92, in collect_signature_info
    signatures = p.starmap(get_info_from_single_sig, [(os.path.join(path_to_temp_dir, 'signatures', file), ksize) for file in os.listdir(os.path.join(path_to_temp_dir, 'signatures'))])
                                                                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/jzr5814/repositories/YACHT/tests/testdata/bug_YAC-98_edit_sig_intermediate_files/signatures'

In YACHT, when a sig.zip file is being trained, the signatures are extracted into a directory called path_to_temp_dir /signatures/ (line 95 utils.py)

Currently, YACHT is unable to extract from a sig file, but it still looks for the signature directory anyway using utils.collect_signature_info (line 66 make_training_from_sketches.py)

Does a .sig file need to extract its signatures? If not, should there be a condition for when it is a .sig file for utils.collect_signature_info?

I'm unsure what type of condition that would be because if the signature file has a single signature, do we tell yacht to NOT collect the signatures from path_to_temp_dir

Issue 2

I tested the yacht train on a sig.zip file

YACHT:$ yacht train --ref_file ./demo/ref.sig.zip --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'demo_ref_test' --outdir ./demo/ --force

Works fine for the ref.sig.zip until it reaches Running sourmash multisearch

2024-02-01 12:13:03 - INFO - Checking reference database file
.zip
2024-02-01 12:13:03 - INFO - Creating a temporary directory
2024-02-01 12:13:03 - INFO - Unzipping the sourmash signature file to the temporary directory
2024-02-01 12:13:03 - INFO - Extracting signature information
100%|████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 128397.06it/s]
2024-02-01 12:13:04 - INFO - Checking if all signatures have the same scaled
2024-02-01 12:13:04 - INFO - Finding the closely related genomes with ANI > ani_thresh from the reference database
2024-02-01 12:13:04 - INFO - Running sourmash multisearch with command: sourmash scripts multisearch /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_sig_files.txt /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_sig_files.txt -k 31 -s 1000 -c 64 -t 0.2039068257457904 -o /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_multisearch_result.csv
Traceback (most recent call last):
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/sourmash", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/__main__.py", line 10, in main
    args = sourmash.cli.parse_args(arglist)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/__init__.py", line 160, in parse_args
    return get_parser().parse_args(arglist)
           ^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/__init__.py", line 141, in get_parser
    getattr(sys.modules[__name__], op).subparser(sub)
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/scripts/__init__.py", line 48, in subparser
    _extension_dict.update(sourmash.plugins.add_cli_scripts(s))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/plugins.py", line 164, in add_cli_scripts
    subparser = parser.add_parser(script_cls.command,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/argparse.py", line 1214, in add_parser
    raise ArgumentError(self, _('conflicting subparser: %s') % name)
argparse.ArgumentError: argument subcmd: conflicting subparser: manysearch
Traceback (most recent call last):
  File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/__init__.py", line 51, in main
    args.func(args)
  File "/data/jzr5814/repositories/YACHT/yacht/make_training_data_from_sketches.py", line 78, in main
    multisearch_result = utils.run_multisearch(num_threads, ani_thresh, ksize, scale, path_to_temp_dir)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/jzr5814/repositories/YACHT/yacht/utils.py", line 119, in run_multisearch
    raise ValueError(f"Error running sourmash multisearch with command: {cmd}")
ValueError: Error running sourmash multisearch with command: sourmash scripts multisearch /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_sig_files.txt /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_sig_files.txt -k 31 -s 1000 -c 64 -t 0.2039068257457904 -o /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_multisearch_result.csv

It looks like utils. run_multisearch is not printing the cmd that it is trying to run

mfl15 commented 5 months ago

refactor collect_signatures_info to work on these other file ending types: .sig, .lca, .sqldb, etc.

chunyuma commented 5 months ago

Hi @dkoslicki,

This seems not a minor issue. To make YACHT handle different formats, many yacht and test code need to be modified. If this isn't urgent, I will address it when I am available later.

For this to-do list, @dkoslicki do you want a .sig file contain a single genome or multiple genomes, or both are accepted? I am not sure if multisearch can handle different all formats (e.g., .sig, .sig.zip, .lca, .sbt, *.sqldb) that @ShaopengLiu1 investigated previously. But we can investigate it.

Currently, the accepted .zip needs to have two files signatures and SOURMASH-MANIFEST.csv. So, @jsrdrgz, you can't just simply zip the .sig files, it will throw an error as you showed.

dkoslicki commented 5 months ago

@chunyuma let's take the practical approach: if multisearch (and other YACHT components) already plays nicely with other formats (.sig.zib, .sbt, etc.) then we should allow it. If it's complicated/non-trivial to make this happen for other file formats, let's just note this and say that file type is not supported.

Regarding do you want a .sig file containing a single genome or multiple genome, or both are accepted, this depends on the YACHT command used. When training, we expect multiple genomes (but 1 is ok). When running the inference step of YACHT, it expects to have: 1. the (multiple genome) reference and 2. a single sketch of the sample (i.e. the metagenome FASTA file has been sketched to a single signature with a single k-mer size).

Re: the signatures and SOURMASH-MANIFEST.csv requirements, perhaps more informative error messages as well as description in the documentation can help avoid others experiencing the same issue Judith did (can't just zip the sig files). Eg. tell the user these are missing and/or that they can create them using sourmash sig manifest.

So long and short of it, no need to do significant refactoring, and not super high priority.

chunyuma commented 5 months ago

Hi @dkoslicki, thanks for the response.

Regading do you want a .sig file containing a single genome or multiple genome, or both are accepted, do you think the condition that a .sig file contains multiple genome sketches is acceptable? @ShaopengLiu1 told me that this can occur if we don't set this constraint.

dkoslicki commented 5 months ago

@chunyuma Here again, it depends on if you are talking about the reference or the sample. If the sample, then the constraint you mentioned should be there: a sample is expected to be a single sketch (if given two with different k-sizes, how would YACHT know which to choose?). If you are talking about a reference, then you would not want the constraint since it's expected that the reference have multiple sketches, one for each reference genome (though perhaps only 1 reference genome in rare circumstances)