B-UMMI / chewBBACA

BSR-Based Allele Calling Algorithm
GNU General Public License v3.0
131 stars 26 forks source link

CreateSchema - ValueError: not enough values to unpack (expected 2, got 1) #194

Closed daraneda96 closed 7 months ago

daraneda96 commented 7 months ago

Hi everyone,

I'm having issues at the CreateSchema step. I believe the problem lies in the input files I'm using. I'm attempting to perform cgMLST on various vibrio assemblies along with reference genomes obtained from NCBI, all of which are stored in a single folder. When running the command: chewBBACA.py CreateSchema -i /home/daniel.araneda/analisis_vibrios/genomas_mlst -o /home/daniel.araneda/analisis_vibrios/mlst_all/mlst_schema --n schema_vibrio --ptf /home/daniel.araneda/analisis_vibrios/mlst_all/vibrio_tf

I encounter the following error: CDS prediction

Predicting CDSs for 104 inputs... [====================] 100% Extracted a total of 460028 CDSs from 104 inputs.

CDS deduplication

Identifying distinct CDSs...

Error on determine_distinct: Traceback (most recent call last): File "/home/daniel.araneda/miniconda3/envs/chewie/lib/python3.10/site-packages/CHEWBBACA/utils/multiprocessing_operations.py", line 42, in function_helper results = input_args-1 File "/home/daniel.araneda/miniconda3/envs/chewie/lib/python3.10/site-packages/CHEWBBACA/utils/sequence_manipulation.py", line 537, in determine_distinct genome_id, protid = seqid.split('-protein') ValueError: not enough values to unpack (expected 2, got 1)

It's worth mentioning that I didn't encounter any issues when using only my assemblies, but the error occurs when including the NCBI reference genomes.

What can I do to solve this error?

Thank you in advance and greetings to the entire community.

D.A.

rfm-targa commented 7 months ago

Greetings @daraneda96,

Thank you for your interest in chewBBACA. The issue seems related to a sequence ID attributed to at least one CDS predicted from the input genomes. Can you share the filenames of the reference genomes downloaded from the NCBI? The ID attributed to each CDS is based on the genome/input unique ID (the prefix before the first . in the file basename), and knowing the filenames might help identify the inputs/files leading to this issue. Also, are you using the latest stable release, v3.3.3? Thank you in advance.

Kind regards,

Rafael

daraneda96 commented 7 months ago

Thank you Rafael, The file names of the reference genomes are like these: Vibrio pacinii GCF_000711795.fasta Vibrio proteolyticus GCF_000467125.fasta Vibrio sp. 10N.286.49.C2 MCUT00000000.fasta

The filenames of my assamblies are like these: M1AB15_k87.fasta C2B10_k91.fasta Maybe the issue is because the filenames of the reference genomes start with Vibrio, followed by a space?. And yes, im using the v3.3.3. Looking forward to your feedback.

Daniel

rfm-targa commented 7 months ago

Greetings,

Yes, the spaces in the filenames will lead to issues. Replacing the spaces with _ should allow you to run CreateSchema without issues. We have a section in the FAQ about filenames here, but in short I suggest avoiding blank spaces and making sure that the substring before the first . in the filename matches a unique identifier that you think is adequate. Let me know if changing the filenames fixed the issue.

Kind regards,

Rafael

daraneda96 commented 7 months ago

Problem solved, thank you very much Rafael. Regards.