B-UMMI / chewBBACA

BSR-Based Allele Calling Algorithm
GNU General Public License v3.0
133 stars 27 forks source link

Error: invalid start byte #173

Closed eam12 closed 1 year ago

eam12 commented 1 year ago

Hi,

I am getting the error message UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 39: invalid start byte (example full output below) when I run the AlleleCall step on some, but not all, of my datasets. AlleleCall successfully runs when the datasets were ~250 genomes, but starts giving me the error when the datasets are ~25 genomes in size. Based on this I assume it has something to do with the number of input genomes, but I'm not sure what.

Unlike the previous user who reported a similar error (#29), this error does interrupt the program and I don't get a results_alleles.tsv file.

Thanks so much.

System info: chewBBACA version: 3.1.2 Python 3.11

Example Output:

chewBBACA version: 3.1.2
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

==========================
  chewBBACA - AlleleCall
==========================
Started at: 2023-03-31T00:16:30

Minimum sequence length: 0
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file: None
CPU cores: 12
BLAST path: /home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 23
Number of loci: 3000

== CDS prediction ==

Predicting CDS for 23 inputs...
 [====================] 100%

== CDS extraction ==

Extracting predicted CDS for 23 inputs...
 [====================] 100%
Extracted a total of 104356 CDS from 23 inputs.

== CDS deduplication ==

Identifying distinct CDS...identified 11086 distinct CDS.

== CDS exact matches ==

Searching for DNA exact matches...found 64358 exact matches (matching 5609 distinct alleles).
Unclassified CDS: 5478

== CDS translation ==

Translating 5478 CDS...
 [====================] 100%
Identified 0 CDS that could not be translated.
Information about untranslatable and small sequences stored in /scratch.global/johnsont/out_Rissen/cgmlst/AlleleCall_1/temp/invalid_cds.txt
Unclassified CDS: 5478

== Protein deduplication ==

Identifying distinct proteins...identified 4958 distinct proteins.

== Protein exact matches ==

Searching for Protein exact matches...found 310 exact matches (433 distinct CDS, 4572 total CDS).
Unclassified proteins: 4802

== Clustering ==

Translating schema's representative alleles...done.
Creating minimizer index for representative alleles...done.
Created index with 0 distinct minimizers for 3000 loci.
Clustering proteins...
 [====================] 100%
Clustered 4802 proteins into 0 clusters.
Unclassified proteins: 4802

== Representative determination ==

Iteration 1
===========
Loci: 0
BLASTing loci representatives against unclassified proteins...done.
Loci with high-scoring matches: 0

== Wrapping up ==

Writing results_contigsInfo.tsv...done.
Writing paralogous_loci.tsv and paralogous_counts.tsv...done.
Detected number of paralogous loci: 0
Traceback (most recent call last):
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin/chewBBACA.py", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main
    functions_info[process][1]()
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
    func(*args, **kwargs)
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call
    AlleleCall.main(genome_list, loci_list, args.schema_directory,
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2825, in main
    total_hashes = update_hash_tables(added[2], loci_to_call,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 239, in update_hash_tables
    current_dna_table = fo.pickle_loader(latest_dna_table)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/utils/file_operations.py", line 284, in pickle_loader
    content = pickle.load(pinfile)
              ^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 39: invalid start byte
rfm-targa commented 1 year ago

Hello @eam12,

Thank you for sharing the stdout printed by chewBBACA. Based on the stdout, the clustering step seems to indicate that some files might be missing from your schema. In the following part of the stdout:

== Clustering ==

Translating schema's representative alleles...done.
Creating minimizer index for representative alleles...done.
Created index with 0 distinct minimizers for 3000 loci.
Clustering proteins...
 [====================] 100%
Clustered 4802 proteins into 0 clusters.
Unclassified proteins: 4802

The minimizer index should have more than 0 distinct minimizers. This might happen because the short directory in your schema does not contain the FASTA files with the representative sequences, or the files cannot be listed. Please verify that your schema includes a subdirectory named short and that that subdirectory contains one FASTA file per locus in the schema. You should also check that your schema includes a subdirectory named pre_computed. If it does, can you please share a list with the names of the files in that directory? Let me know what you find.

Rafael

eam12 commented 1 year ago

Thank you so much for your prompt reply! I do have a directory entitled short in my schema and in it there is a single FASTA file for all 3000 loci (e.g. SC0831_short.fasta). There is also a subdirectory named pre_computed in my schema that contains the following files: DNAtable1 DNAtable12 DNAtable15 DNAtable18 DNAtable20 DNAtable23 DNAtable26 DNAtable4 DNAtable7 DNAtable10 DNAtable13 DNAtable16 DNAtable19 DNAtable21 DNAtable24 DNAtable27 DNAtable5 DNAtable8 DNAtable11 DNAtable14 DNAtable17 DNAtable2 DNAtable22 DNAtable25 DNAtable3 DNAtable6 DNAtable9 PROTEINtable1 PROTEINtable13 PROTEINtable17 PROTEINtable20 PROTEINtable24 PROTEINtable3 PROTEINtable7 PROTEINtable10 PROTEINtable14 PROTEINtable18 PROTEINtable21 PROTEINtable25 PROTEINtable4 PROTEINtable8 PROTEINtable11 PROTEINtable15 PROTEINtable19 PROTEINtable22 PROTEINtable26 PROTEINtable5 PROTEINtable9 PROTEINtable12 PROTEINtable16 PROTEINtable2 PROTEINtable23 PROTEINtable27 PROTEINtable6

Something I just thought about...are any files being modified within the schema as AlleleCall is running? I am using chewBBACA within a job array so AlleleCall is simultaneously running anywhere from 2-16 times simultaneously with different datasets. Could that be the issue here?

ramirma commented 1 year ago

@eam12 running concurrent allele calls is probably the cause of the error you get. ChewBBACA was not designed for this if you run it in the standard mode. New representative alleles are added to the files in the short directory for instance.

eam12 commented 1 year ago

Darn! I've been trying to incorporate chewBBACA into a pipeline I'm creating, but this will make that hard. Is there any way around this short of creating duplicate schema- one for each job in the array? Thanks so much!

ramirma commented 1 year ago

Dear @eam12, You may run chewBBACA in such a way that no new alleles are added to the database. In this way the schema will not be altered and it will allow running multiple instances at the same time. You may want to have a look at issue #168. I would warn you though that in this way your schema will remain static. I hope to have clarified your questions. Mario

eam12 commented 1 year ago

Hi Mario,

Thank you for the advice. I did try running a single AlleleCall job with the --no-inferred flag, with and without the --mode 1 flag, but now I'm getting a similar error message (see below). I'm also getting the error if I run a single job without either flag. I'm unsure what the issue is now. Could the database still be messed up from previous runs that errored out?

2023-04-02 13:29:01: Running chewBBACA...

chewBBACA version: 3.1.2
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

==========================
  chewBBACA - AlleleCall
==========================
Started at: 2023-04-02T13:29:01

Minimum sequence length: 0
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file: None
CPU cores: 12
BLAST path: /home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 50
Number of loci: 3000
Traceback (most recent call last):
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin/chewBBACA.py", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main
    functions_info[process][1]()
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
    func(*args, **kwargs)
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call
    AlleleCall.main(genome_list, loci_list, args.schema_directory,
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2691, in main
    loci_modes = fo.pickle_loader(loci_modes_file)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/utils/file_operations.py", line 284, in pickle_loader
    content = pickle.load(pinfile)
              ^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
ramirma commented 1 year ago

@eam12, it does seem so. We store a number of things in pickle files when running chewBBACA and the previous errors may have corrupted some of these files. My suggestion would be to start with a new conda environment and a new schema, i.e. a complete fresh start. In that way we make sure we have no lingering problems from previous errors. You should them be able to complete a run without problems. Mario

rfm-targa commented 1 year ago

@eam12, to add something to what @ramirma has explained about performing allele calling concurrently. You'll need to start with a fresh schema and perform a single allele calling process to create the pre-computed data. You can use a single genome assembly; it's only essential to generate the pre-computed data that chewBBACA uses to speed up the allele calling. After that, multiple users can concurrently perform allele calling based on the same schema if they pass the --no-inferred parameter. chewBBACA will still identify novel alleles and include them in the final results, but those alleles will not be added to the schema, and the pre-computed files will not be updated. If you ever want to add new alleles to the schema, you'll have to perform allele calling without the --no-inferred parameter and ensure that there's only one process working with the schema while it is updated. We'll add more detail about these options to the documentation. We're also working on a module that accepts results from previous allele calling runs and updates a schema with the novel alleles inferred during those runs. One possible use case for this module is to periodically update a schema that is used for concurrent allele calling with all the novel alleles that users have found. Let us know if you run into any issues while trying to run concurrent processes with the --no-inferred parameter.

Rafael

eam12 commented 1 year ago

@ramirma and @rfm-targa Thank you both so much for all of your help! I downloaded a fresh scheme and successfully ran chewBBACA.py PrepExternalSchema. However, when I try to run AlleleCall with a single sample, as you suggested, I get the following error message:

% chewBBACA.py AlleleCall -i ~/chewbbaca_test/fastas/ -g /home/johnsont/shared/dbs/enterobase_cgMLSTv2/chewbbaca_enterobase_cgMLSTv2_salmonella_03APR23 -o ~/chewbbaca_test/AlleleCall --cpu ${SLURM_CPUS_PER_TASK} --no-inferred

chewBBACA version: 3.1.2
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

==========================
  chewBBACA - AlleleCall
==========================
Started at: 2023-04-03T17:12:52

Output directory exists. Will store results in /home/johnsont/millere/chewbbaca_test/AlleleCall/results_20230403T171252.
Minimum sequence length: 0
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file: None
CPU cores: 12
BLAST path: /home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 1
Number of loci: 3000

Determining sequence length mode for all loci...done.

Creating pre-computed hash tables...done.
Traceback (most recent call last):
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/bin/chewBBACA.py", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main
    functions_info[process][1]()
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
    func(*args, **kwargs)
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call
    AlleleCall.main(genome_list, loci_list, args.schema_directory,
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2711, in main
    loci_to_call = {file: schema_loci_fullpath.index(file)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnsont/shared/.conda/envs/chewbbaca_3.1.2/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2711, in <dictcomp>
    loci_to_call = {file: schema_loci_fullpath.index(file)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: '/home/johnsont/shared/dbs/enterobase_cgMLSTv2/chewbbaca_enterobase_cgMLSTv2_salmonella_03APR23/STMMW_09791.fas' is not in list

I'm not sure what this means as STMMW_09791.fas is present in the scheme and looks identical to the other gene's allele sequence files. Do you have any insight into this?

rfm-targa commented 1 year ago

Hello @eam12,

The issue you're reporting is caused by a limitation of the PrepExternalSchema module. The FASTA files in the external schema must end with .fasta. The schema will not be adapted correctly if the file extension is different. In your case, the files in the external schema seem to end in .fas and the final schema will not include some data that are necessary to run the allele calling. I've downloaded the Salmonella enterica cgMLST schema from Enterobase and verified that if you change the .fas extension in the schema files to .fasta, you should be able to adapt and run the AlleleCall module without any issues. We'll fix this issue in the PrepExternalSchema module and add the fix in the next release. The module will accept FASTA files terminating in .fasta, .fna, .ffn, .fa and .fas and change the file extension to .fasta in the adapted schema so that it is consistent with other modules. Let us know if changing the file extension from .fas to .fasta solves the issue.

Rafael

eam12 commented 1 year ago

That makes perfect sense! I should have picked up on that. Once I changed the file extensions to .fasta everything has worked beautifully. Thank you very much for all of your assistance!