Closed eam12 closed 1 year ago
Hello @eam12,
Thank you for sharing the stdout printed by chewBBACA. Based on the stdout, the clustering step seems to indicate that some files might be missing from your schema. In the following part of the stdout:
== Clustering ==
Translating schema's representative alleles...done.
Creating minimizer index for representative alleles...done.
Created index with 0 distinct minimizers for 3000 loci.
Clustering proteins...
[====================] 100%
Clustered 4802 proteins into 0 clusters.
Unclassified proteins: 4802
The minimizer index should have more than 0 distinct minimizers. This might happen because the short
directory in your schema does not contain the FASTA files with the representative sequences, or the files cannot be listed. Please verify that your schema includes a subdirectory named short
and that that subdirectory contains one FASTA file per locus in the schema. You should also check that your schema includes a subdirectory named pre_computed
. If it does, can you please share a list with the names of the files in that directory?
Let me know what you find.
Rafael
Thank you so much for your prompt reply! I do have a directory entitled short
in my schema and in it there is a single FASTA file for all 3000 loci (e.g. SC0831_short.fasta
). There is also a subdirectory named pre_computed
in my schema that contains the following files:
DNAtable1 DNAtable12 DNAtable15 DNAtable18 DNAtable20 DNAtable23 DNAtable26 DNAtable4 DNAtable7
DNAtable10 DNAtable13 DNAtable16 DNAtable19 DNAtable21 DNAtable24 DNAtable27 DNAtable5 DNAtable8
DNAtable11 DNAtable14 DNAtable17 DNAtable2 DNAtable22 DNAtable25 DNAtable3 DNAtable6 DNAtable9
PROTEINtable1 PROTEINtable13 PROTEINtable17 PROTEINtable20 PROTEINtable24 PROTEINtable3 PROTEINtable7
PROTEINtable10 PROTEINtable14 PROTEINtable18 PROTEINtable21 PROTEINtable25 PROTEINtable4 PROTEINtable8
PROTEINtable11 PROTEINtable15 PROTEINtable19 PROTEINtable22 PROTEINtable26 PROTEINtable5 PROTEINtable9
PROTEINtable12 PROTEINtable16 PROTEINtable2 PROTEINtable23 PROTEINtable27 PROTEINtable6
Something I just thought about...are any files being modified within the schema as AlleleCall is running? I am using chewBBACA within a job array so AlleleCall is simultaneously running anywhere from 2-16 times simultaneously with different datasets. Could that be the issue here?
Darn! I've been trying to incorporate chewBBACA into a pipeline I'm creating, but this will make that hard. Is there any way around this short of creating duplicate schema- one for each job in the array? Thanks so much!
Dear @eam12, You may run chewBBACA in such a way that no new alleles are added to the database. In this way the schema will not be altered and it will allow running multiple instances at the same time. You may want to have a look at issue #168. I would warn you though that in this way your schema will remain static. I hope to have clarified your questions. Mario
Hi Mario,
Thank you for the advice. I did try running a single AlleleCall job with the --no-inferred
flag, with and without the --mode 1
flag, but now I'm getting a similar error message (see below). I'm also getting the error if I run a single job without either flag. I'm unsure what the issue is now. Could the database still be messed up from previous runs that errored out?
2023-04-02 13:29:01: Running chewBBACA...
chewBBACA version: 3.1.2
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt
==========================
chewBBACA - AlleleCall
==========================
Started at: 2023-04-02T13:29:01
Minimum sequence length: 0
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file: None
CPU cores: 12
BLAST path: /home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 50
Number of loci: 3000
Traceback (most recent call last):
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin/chewBBACA.py", line 10, in <module>
sys.exit(main())
^^^^^^
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main
functions_info[process][1]()
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
func(*args, **kwargs)
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call
AlleleCall.main(genome_list, loci_list, args.schema_directory,
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2691, in main
loci_modes = fo.pickle_loader(loci_modes_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/utils/file_operations.py", line 284, in pickle_loader
content = pickle.load(pinfile)
^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
@eam12, it does seem so. We store a number of things in pickle files when running chewBBACA and the previous errors may have corrupted some of these files. My suggestion would be to start with a new conda environment and a new schema, i.e. a complete fresh start. In that way we make sure we have no lingering problems from previous errors. You should them be able to complete a run without problems. Mario
@eam12, to add something to what @ramirma has explained about performing allele calling concurrently. You'll need to start with a fresh schema and perform a single allele calling process to create the pre-computed data. You can use a single genome assembly; it's only essential to generate the pre-computed data that chewBBACA uses to speed up the allele calling. After that, multiple users can concurrently perform allele calling based on the same schema if they pass the --no-inferred
parameter. chewBBACA will still identify novel alleles and include them in the final results, but those alleles will not be added to the schema, and the pre-computed files will not be updated. If you ever want to add new alleles to the schema, you'll have to perform allele calling without the --no-inferred
parameter and ensure that there's only one process working with the schema while it is updated.
We'll add more detail about these options to the documentation. We're also working on a module that accepts results from previous allele calling runs and updates a schema with the novel alleles inferred during those runs. One possible use case for this module is to periodically update a schema that is used for concurrent allele calling with all the novel alleles that users have found.
Let us know if you run into any issues while trying to run concurrent processes with the --no-inferred
parameter.
Rafael
@ramirma and @rfm-targa Thank you both so much for all of your help! I downloaded a fresh scheme and successfully ran chewBBACA.py PrepExternalSchema
. However, when I try to run AlleleCall with a single sample, as you suggested, I get the following error message:
% chewBBACA.py AlleleCall -i ~/chewbbaca_test/fastas/ -g /home/johnsont/shared/dbs/enterobase_cgMLSTv2/chewbbaca_enterobase_cgMLSTv2_salmonella_03APR23 -o ~/chewbbaca_test/AlleleCall --cpu ${SLURM_CPUS_PER_TASK} --no-inferred
chewBBACA version: 3.1.2
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt
==========================
chewBBACA - AlleleCall
==========================
Started at: 2023-04-03T17:12:52
Output directory exists. Will store results in /home/johnsont/millere/chewbbaca_test/AlleleCall/results_20230403T171252.
Minimum sequence length: 0
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file: None
CPU cores: 12
BLAST path: /home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 1
Number of loci: 3000
Determining sequence length mode for all loci...done.
Creating pre-computed hash tables...done.
Traceback (most recent call last):
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin/chewBBACA.py", line 10, in <module>
sys.exit(main())
^^^^^^
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main
functions_info[process][1]()
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
func(*args, **kwargs)
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call
AlleleCall.main(genome_list, loci_list, args.schema_directory,
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2711, in main
loci_to_call = {file: schema_loci_fullpath.index(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2711, in <dictcomp>
loci_to_call = {file: schema_loci_fullpath.index(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: '/home/johnsont/shared/dbs/enterobase_cgMLSTv2/chewbbaca_enterobase_cgMLSTv2_salmonella_03APR23/STMMW_09791.fas' is not in list
I'm not sure what this means as STMMW_09791.fas
is present in the scheme and looks identical to the other gene's allele sequence files. Do you have any insight into this?
Hello @eam12,
The issue you're reporting is caused by a limitation of the PrepExternalSchema
module. The FASTA files in the external schema must end with .fasta
. The schema will not be adapted correctly if the file extension is different. In your case, the files in the external schema seem to end in .fas
and the final schema will not include some data that are necessary to run the allele calling. I've downloaded the Salmonella enterica cgMLST schema from Enterobase and verified that if you change the .fas
extension in the schema files to .fasta
, you should be able to adapt and run the AlleleCall module without any issues.
We'll fix this issue in the PrepExternalSchema
module and add the fix in the next release. The module will accept FASTA files terminating in .fasta
, .fna
, .ffn
, .fa
and .fas
and change the file extension to .fasta
in the adapted schema so that it is consistent with other modules.
Let us know if changing the file extension from .fas
to .fasta
solves the issue.
Rafael
That makes perfect sense! I should have picked up on that. Once I changed the file extensions to .fasta
everything has worked beautifully. Thank you very much for all of your assistance!
Hi,
I am getting the error message
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 39: invalid start byte
(example full output below) when I run the AlleleCall step on some, but not all, of my datasets. AlleleCall successfully runs when the datasets were ~250 genomes, but starts giving me the error when the datasets are ~25 genomes in size. Based on this I assume it has something to do with the number of input genomes, but I'm not sure what.Unlike the previous user who reported a similar error (#29), this error does interrupt the program and I don't get a results_alleles.tsv file.
Thanks so much.
System info: chewBBACA version: 3.1.2 Python 3.11
Example Output: