Open dkoslicki opened 7 months ago
@dkoslicki
The title here is a bit confusing to the error. YACHT does not like zip files or does not like sig files?
I was able to reproduce this error and fixed the condition on the type of file that is being accepted, so now a .sig.zip and .sig files can be accepted.
However, now I get the following error where the file will be extracted so a zipfile is expected.
Is this necessary in YACHT? Unsure how to move on from here.
2024-01-13 13:10:26 - INFO - Checking reference database file
2024-01-13 13:10:26 - INFO - Creating a temporary directory
2024-01-13 13:10:26 - INFO - Unzipping the sourmash signature file to the temporary directory
Traceback (most recent call last):
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
sys.exit(main())
^^^^^^
File "/data/jzr5814/repositories/YACHT/yacht/__init__.py", line 51, in main
args.func(args)
File "/data/jzr5814/repositories/YACHT/yacht/make_training_data_from_sketches.py", line 61, in main
with zipfile.ZipFile(ref_file, 'r') as sourmash_db:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/zipfile/__init__.py", line 1338, in __init__
self._RealGetContents()
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/zipfile/__init__.py", line 1405, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
TODO:
I used the demo files to test.. I am running into two issues:
(1) signature directory is not being created when running a sig file (2) multisearch issue when running yacht train on sig.sip file.
Could someone check if this works on their end? I'll just start from a blank slate if it is not an issue for others.
Issue 1
sketch ref genomes for a sig file
sourmash sketch dna ./demo/ref_genomes/* -p k=31,scaled=1000,abund -o ./demo/ref.sig
Running yacht train on a sig file
yacht train --ref_file ./demo/ref.sig --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'demo_ref_sig_test' --outdir ./demo/ --force
Returns an error where the signature directory is not being created.
prefix 'bug_YAC-98_edit_sig' --outdir ./tests/testdata/ --force
2024-02-01 12:04:10 - INFO - Checking reference database file
.sig
2024-02-01 12:04:10 - INFO - Creating a temporary directory
2024-02-01 12:04:10 - INFO - Extracting signature information
Traceback (most recent call last):
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
sys.exit(main())
^^^^^^
File "/data/jzr5814/repositories/YACHT/yacht/__init__.py", line 51, in main
args.func(args)
File "/data/jzr5814/repositories/YACHT/yacht/make_training_data_from_sketches.py", line 68, in main
sig_info_dict = utils.collect_signature_info(num_threads, ksize, path_to_temp_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/jzr5814/repositories/YACHT/yacht/utils.py", line 92, in collect_signature_info
signatures = p.starmap(get_info_from_single_sig, [(os.path.join(path_to_temp_dir, 'signatures', file), ksize) for file in os.listdir(os.path.join(path_to_temp_dir, 'signatures'))])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/jzr5814/repositories/YACHT/tests/testdata/bug_YAC-98_edit_sig_intermediate_files/signatures'
In YACHT, when a sig.zip file is being trained, the signatures are extracted into a directory called path_to_temp_dir /signatures/
(line 95 utils.py
)
Currently, YACHT is unable to extract from a sig
file, but it still looks for the signature directory anyway using utils.collect_signature_info
(line 66 make_training_from_sketches.py
)
Does a .sig
file need to extract its signatures?
If not, should there be a condition for when it is a .sig
file for utils.collect_signature_info
?
I'm unsure what type of condition that would be because if the signature file has a single signature, do we tell yacht to NOT collect the signatures from path_to_temp_dir
Issue 2
I tested the yacht train on a sig.zip file
YACHT:$ yacht train --ref_file ./demo/ref.sig.zip --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'demo_ref_test' --outdir ./demo/ --force
Works fine for the ref.sig.zip until it reaches Running sourmash multisearch
2024-02-01 12:13:03 - INFO - Checking reference database file
.zip
2024-02-01 12:13:03 - INFO - Creating a temporary directory
2024-02-01 12:13:03 - INFO - Unzipping the sourmash signature file to the temporary directory
2024-02-01 12:13:03 - INFO - Extracting signature information
100%|████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 128397.06it/s]
2024-02-01 12:13:04 - INFO - Checking if all signatures have the same scaled
2024-02-01 12:13:04 - INFO - Finding the closely related genomes with ANI > ani_thresh from the reference database
2024-02-01 12:13:04 - INFO - Running sourmash multisearch with command: sourmash scripts multisearch /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_sig_files.txt /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_sig_files.txt -k 31 -s 1000 -c 64 -t 0.2039068257457904 -o /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_multisearch_result.csv
Traceback (most recent call last):
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/sourmash", line 11, in <module>
sys.exit(main())
^^^^^^
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/__main__.py", line 10, in main
args = sourmash.cli.parse_args(arglist)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/__init__.py", line 160, in parse_args
return get_parser().parse_args(arglist)
^^^^^^^^^^^^
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/__init__.py", line 141, in get_parser
getattr(sys.modules[__name__], op).subparser(sub)
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/cli/scripts/__init__.py", line 48, in subparser
_extension_dict.update(sourmash.plugins.add_cli_scripts(s))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/site-packages/sourmash/plugins.py", line 164, in add_cli_scripts
subparser = parser.add_parser(script_cls.command,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/lib/python3.12/argparse.py", line 1214, in add_parser
raise ArgumentError(self, _('conflicting subparser: %s') % name)
argparse.ArgumentError: argument subcmd: conflicting subparser: manysearch
Traceback (most recent call last):
File "/home/grads/jzr5814/miniconda3/envs/yacht_env/bin/yacht", line 8, in <module>
sys.exit(main())
^^^^^^
File "/data/jzr5814/repositories/YACHT/yacht/__init__.py", line 51, in main
args.func(args)
File "/data/jzr5814/repositories/YACHT/yacht/make_training_data_from_sketches.py", line 78, in main
multisearch_result = utils.run_multisearch(num_threads, ani_thresh, ksize, scale, path_to_temp_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/jzr5814/repositories/YACHT/yacht/utils.py", line 119, in run_multisearch
raise ValueError(f"Error running sourmash multisearch with command: {cmd}")
ValueError: Error running sourmash multisearch with command: sourmash scripts multisearch /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_sig_files.txt /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_sig_files.txt -k 31 -s 1000 -c 64 -t 0.2039068257457904 -o /data/jzr5814/repositories/YACHT/demo/demo_ref_test_intermediate_files/training_multisearch_result.csv
It looks like utils. run_multisearch
is not printing the cmd that it is trying to run
refactor collect_signatures_info to work on these other file ending types: .sig, .lca, .sqldb, etc.
Hi @dkoslicki,
This seems not a minor issue. To make YACHT handle different formats, many yacht and test code need to be modified. If this isn't urgent, I will address it when I am available later.
For this to-do list, @dkoslicki do you want a .sig file contain a single genome or multiple genomes, or both are accepted? I am not sure if multisearch
can handle different all formats (e.g., .sig, .sig.zip, .lca, .sbt, *.sqldb) that @ShaopengLiu1 investigated previously. But we can investigate it.
Currently, the accepted .zip
needs to have two files signatures
and SOURMASH-MANIFEST.csv
. So, @jsrdrgz, you can't just simply zip the .sig files, it will throw an error as you showed.
@chunyuma let's take the practical approach: if multisearch
(and other YACHT components) already plays nicely with other formats (.sig.zib, .sbt, etc.) then we should allow it. If it's complicated/non-trivial to make this happen for other file formats, let's just note this and say that file type is not supported.
Regarding do you want a .sig file containing a single genome or multiple genome, or both are accepted
, this depends on the YACHT command used. When training, we expect multiple genomes (but 1 is ok). When running the inference step of YACHT, it expects to have: 1. the (multiple genome) reference and 2. a single sketch of the sample (i.e. the metagenome FASTA file has been sketched to a single signature with a single k-mer size).
Re: the signatures
and SOURMASH-MANIFEST.csv
requirements, perhaps more informative error messages as well as description in the documentation can help avoid others experiencing the same issue Judith did (can't just zip the sig files). Eg. tell the user these are missing and/or that they can create them using sourmash sig manifest
.
So long and short of it, no need to do significant refactoring, and not super high priority.
Hi @dkoslicki, thanks for the response.
Regading do you want a .sig file containing a single genome or multiple genome, or both are accepted
, do you think the condition that a .sig
file contains multiple genome sketches is acceptable? @ShaopengLiu1 told me that this can occur if we don't set this constraint.
@chunyuma Here again, it depends on if you are talking about the reference or the sample. If the sample, then the constraint you mentioned should be there: a sample is expected to be a single sketch (if given two with different k-sizes, how would YACHT know which to choose?). If you are talking about a reference, then you would not want the constraint since it's expected that the reference have multiple sketches, one for each reference genome (though perhaps only 1 reference genome in rare circumstances)
Traceback (most recent call last): File "/scratch/temp/Yacht_issue_12/YACHT/make_training_data_from_sketches.py", line 44, in
raise ValueError(f"Reference database file {ref_file} is not a zip file. Please a Sourmash signature database file with Zipfile format.")
ValueError: Reference database file /scratch/temp/Yacht_issue_12/97_Silva_111_rep_set_euk.fasta.sig is not a zip file. Please a Sourmash signature database file with Zipfile format.