Open kamivain opened 1 year ago
Dear @kamivain,
Thank you for your interest in chewBBACA. Please have a look at issue #176. I note that you are using python 3.10. Althought this should not be a problem we do advise to use python 3.9, this may also result in a clearer error reporting. The other potential problem is if you are using BLAST>2.9. Please downgrade if necessary because we know there are incompatibilities. If downgrading BLAST does not solve the problem there may be problems with the file or contig names. Please look into the previous issues reported on this.
Best Regards,
Mario
Dear Mario, Thank you for your reply,I would adopt your advise and try the program again. Best Regards, kamivain
Original Email
Sender:"ramirma"< @.*** >;
Sent Time:2023/8/14 16:08
To:"B-UMMI/chewBBACA"< @.*** >;
Cc recipient:"kamivain"< @. >;"Mention"< @. >;
Subject:Re: [B-UMMI/chewBBACA] Ecountering an issue when running AlleleCall(Issue #182)
Dear @kamivain,
Thank you for your interest in chewBBACA. Please have a look at issue #176. I note that you are using python 3.10. Althought this should not be a problem we do advise to use python 3.9, this may also result in a clearer error reporting. The other potential problem is if you are using BLAST>2.9. Please downgrade if necessary because we know there are incompatibilities. If downgrading BLAST does not solve the problem there may be problems with the file or contig names. Please look into the previous issues reported on this.
Best Regards,
Mario
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
I have the similar problem, but if I apply the command on a selection of the genomes it appears to be solved. Conversely, when applied on the second part I have agains the problem.
Greetings @Fla1487,
Thank you for your interest in chewBBACA. Based on what you report, it might be related to issues in one or several input files (badly formatted files, special characters in the filename or sequence headers, etc). Updating to the latest version may also help, as it solves several issues in older versions. If you cannot find the cause of the issue, please share what's printed to the stdout, as it might include enough information to determine the type of issue.
Kind regards,
Rafael
Edits: I have also tried to set the python to ver 3.9 and BLAST to version 2.9 but it came out with a new error (see file: allele_call_pubold12_err_new.txt allele_call_pubold12_err_new.txt)
Hi chewBBACA developer,
I encounter the exact error, but only to a subset of my genomes. So, initially, I tried to perform AlleleCall for 2500 genomes which failed due to the same error. Then I did multiple AlleleCall to 4 batches of 600-700 genomes, some of them succeeded, but some failed (N = 937 genomes). This is what I have done:
Below is the code
chewBBACA.py AlleleCall -i ./fasta/ -g ./cgmlst_scheme -o ./output_pub-old1-1_2-3 --output-unclassified --output-missing --output-novel --cpu 8
The error file and output file of this run are attached. Please kindly look into this and what can you suggest for me to do? Thank you very much!
Best, Krisna allele_call_pubold12_err.txt allele_call_pubold12_out.txt
Hello @artmisk13,
Thank you for reporting this issue. We know of more users who have encountered this bug under similar circumstances. Based on what users report, it should be related to a single or a set of input files. We never got the same issue or managed to reproduce the error even when users shared data. That is the reason why we could not look into this properly. This error is strange because chewBBACA cannot get a sequence that should be in the FASTA file. Could you share a minimal test case that leads to the same error? For example, we can use the schema, a subset of the schema loci, and a genome to find and fix the issue. Any data you share with us is handled privately; we will only use it for bug fixing (you can upload a Zip with the data to WeTransfer and send the link to imm-bioinfo@medicina.ulisboa.pt).
Also, part of the problem might be related to the environment configuration. If you are using a conda environment to run chewie, can you run conda list -n <ENV_NAME> --export > package-list.txt
to get the list of packages in the environment? If you share that file with us, we can create an environment with the same packages as yours, which might help us identify the issue if it is related to any specific package.
Lastly, the BLAST error BLAST Database creation error: Multi-letters chain PDB id is not supported in v4 BLAST DB
should be related to input files with a short, unique identifier (4 or fewer chars) and numeric only (e.g. 123.fna
has 123
as a unique identifier). It should work if you change the file names to be composed of more than 4 chars or add a letter.
Let us know if you can share some data and if changing the file names fixes the BLAST error.
Best regards,
Rafael
Hi Rafael,
Thanks for your thorough explanation and suggestions, they are really helpful!
So I'm guessing there is a problem somewhere in 1) reading the fasta files when the name only has numerical characters and 2) creating the "missing_classes.fasta" file when there is a '-' separator in the input fasta name (problem in string variable splitting?). The 2nd problem probably has been addressed in the newer chewie version. I hope this new information helps you further in debugging the AlleleCall module.
*Once the manuscript is accepted for publication, I am happy to upload the scheme to Chewie-NS so more people can use it!
Best, Krisna
Hello @artmisk13,
Thank you for sharing the details and data about the errors. It will help us a lot. We will probably change how IDs are processed internally to solve this kind of issue for good. Uploading the schema to Chewie-NS would be great. Just let us know when you'd like to do it, and we'll add you as a contributor or upload it if you prefer us to handle that. I will let you know when we have changed things.
Best regards,
Rafael
Hello @artmisk13,
We released chewBBACA v3.3.9. This version includes changes to check if BLAST interprets input unique IDs as PDB chain IDs or if it modifies the IDs at all. We use makeblastdb
to create BLAST databases (DBs) to search for matches based on lists of identifiers. To use the list of identifiers, we need to include -parse_seqids
when creating the DBs, and that leads to the issue where BLAST modifies some of the sequence IDs in the FASTA used to make the DB. This is a problem when we cannot match the IDs recovered from the DB to those in the original FASTA file. To avoid this issue, chewBBACA will warn users about any input files affected whose unique IDs lead to the issue. To continue, users will have to rename the files. This is safer than accepting the files and changing/checking everything internally to ensure it works.
Let us know if the latest version identifies the input files in your dataset that caused the issue.
Kind regards,
Rafael
Hello,I have encountered an issue when running AlleleCall to the genomes. It said "AttributeError: 'NoneType' object has no attribute 'seq'", what's the matter, thank you!
$ chewBBACA.py AlleleCall -i bu_genome -g bu_schema/schema_seed/ --gl bu_result_wgMLST/cgMLST/cgMLSTschema99.txt -o bu_result251_cgMLST --cpu 2
chewBBACA version: 3.2.0 Authors: Rafael Mamede, Pedro Cerqueira, Mickael Silva, João Carriço, Mário Ramirez Github: https://github.com/B-UMMI/chewBBACA Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html Contacts: imm-bioinfo@medicina.ulisboa.pt
========================== chewBBACA - AlleleCall
Started at: 2023-08-13T22:39:06
Minimum sequence length: 0 Size threshold: 0.2 Translation table: 11 BLAST Score Ratio: 0.6 Word size: 5 Window size: 5 Clustering similarity: 0.2 Prodigal training file: bu_schema/schema_seed/bu_train.trn CPU cores: 2 BLAST path: /usr/bin CDS input: False Prodigal mode: single Mode: 4 Number of inputs: 251 Number of loci: 971
== CDS prediction ==
Predicting CDS for 251 inputs... [====================] 100%
== CDS extraction ==
Extracting predicted CDS for 251 inputs... [====================] 100% Extracted a total of 1694809 CDS from 251 inputs.
== CDS deduplication ==
Identifying distinct CDS...identified 603928 distinct CDS.
== CDS exact matches ==
Searching for DNA exact matches...found 194185 exact matches (matching 38271 distinct alleles). Unclassified CDS: 565657
== CDS translation ==
Translating 565657 CDS... [====================] 100% Identified 3633 CDS that could not be translated. Information about untranslatable and small sequences stored in bu_result251_cgMLST/temp/invalid_cds.txt Unclassified CDS: 562024
== Protein deduplication ==
Identifying distinct proteins...identified 296723 distinct proteins.
== Protein exact matches ==
Searching for Protein exact matches...found 5906 exact matches (22513 distinct CDS, 30655 total CDS). Unclassified proteins: 290823
== Clustering ==
Translating schema's representative alleles...done. Creating minimizer index for representative alleles...done. Created index with 81137 distinct minimizers for 971 loci. Clustering proteins... [====================] 100% Clustered 290823 proteins into 984 clusters. Clusters to BLAST: 984 [====================] 100% Classifying clustered proteins... [====================] 100% Classified 11856 distinct proteins. Unclassified proteins: 278967
== Representative determination ==
Iteration 1
Loci: 971 BLASTing loci representatives against unclassified proteins...done. Traceback (most recent call last): File "/home/yao/.local/bin/chewBBACA.py", line 8, in
sys.exit(main())
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/chewBBACA.py", line 1545, in main
functions_info[process][1]()
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
func(*args, **kwargs)
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/chewBBACA.py", line 528, in allele_call
AlleleCall.main(genome_list, loci_list, args.schema_directory,
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2718, in main
results = allele_calling(input_files, schema_directory, temp_directory,
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2510, in allele_calling
locus_results = expand_matches(match_info, prot_index, dna_index,
File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 1389, in expand_matches
target_protein = str(pfasta_index.get(target_id).seq)
AttributeError: 'NoneType' object has no attribute 'seq'