B-UMMI / chewBBACA

BSR-Based Allele Calling Algorithm
GNU General Public License v3.0
133 stars 27 forks source link

Ecountering an issue when running AlleleCall #182

Open kamivain opened 1 year ago

kamivain commented 1 year ago

Hello,I have encountered an issue when running AlleleCall to the genomes. It said "AttributeError: 'NoneType' object has no attribute 'seq'", what's the matter, thank you!

$ chewBBACA.py AlleleCall -i bu_genome -g bu_schema/schema_seed/ --gl bu_result_wgMLST/cgMLST/cgMLSTschema99.txt -o bu_result251_cgMLST --cpu 2

chewBBACA version: 3.2.0 Authors: Rafael Mamede, Pedro Cerqueira, Mickael Silva, João Carriço, Mário Ramirez Github: https://github.com/B-UMMI/chewBBACA Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html Contacts: imm-bioinfo@medicina.ulisboa.pt

========================== chewBBACA - AlleleCall

Started at: 2023-08-13T22:39:06

Minimum sequence length: 0 Size threshold: 0.2 Translation table: 11 BLAST Score Ratio: 0.6 Word size: 5 Window size: 5 Clustering similarity: 0.2 Prodigal training file: bu_schema/schema_seed/bu_train.trn CPU cores: 2 BLAST path: /usr/bin CDS input: False Prodigal mode: single Mode: 4 Number of inputs: 251 Number of loci: 971

== CDS prediction ==

Predicting CDS for 251 inputs... [====================] 100%

== CDS extraction ==

Extracting predicted CDS for 251 inputs... [====================] 100% Extracted a total of 1694809 CDS from 251 inputs.

== CDS deduplication ==

Identifying distinct CDS...identified 603928 distinct CDS.

== CDS exact matches ==

Searching for DNA exact matches...found 194185 exact matches (matching 38271 distinct alleles). Unclassified CDS: 565657

== CDS translation ==

Translating 565657 CDS... [====================] 100% Identified 3633 CDS that could not be translated. Information about untranslatable and small sequences stored in bu_result251_cgMLST/temp/invalid_cds.txt Unclassified CDS: 562024

== Protein deduplication ==

Identifying distinct proteins...identified 296723 distinct proteins.

== Protein exact matches ==

Searching for Protein exact matches...found 5906 exact matches (22513 distinct CDS, 30655 total CDS). Unclassified proteins: 290823

== Clustering ==

Translating schema's representative alleles...done. Creating minimizer index for representative alleles...done. Created index with 81137 distinct minimizers for 971 loci. Clustering proteins... [====================] 100% Clustered 290823 proteins into 984 clusters. Clusters to BLAST: 984 [====================] 100% Classifying clustered proteins... [====================] 100% Classified 11856 distinct proteins. Unclassified proteins: 278967

== Representative determination ==

Iteration 1

Loci: 971 BLASTing loci representatives against unclassified proteins...done. Traceback (most recent call last): File "/home/yao/.local/bin/chewBBACA.py", line 8, in sys.exit(main()) File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/chewBBACA.py", line 1545, in main functions_info[process][1]() File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper func(*args, **kwargs) File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/chewBBACA.py", line 528, in allele_call AlleleCall.main(genome_list, loci_list, args.schema_directory, File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2718, in main results = allele_calling(input_files, schema_directory, temp_directory, File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2510, in allele_calling locus_results = expand_matches(match_info, prot_index, dna_index, File "/home/yao/.local/lib/python3.10/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 1389, in expand_matches target_protein = str(pfasta_index.get(target_id).seq) AttributeError: 'NoneType' object has no attribute 'seq'

ramirma commented 1 year ago

Dear @kamivain,

Thank you for your interest in chewBBACA. Please have a look at issue #176. I note that you are using python 3.10. Althought this should not be a problem we do advise to use python 3.9, this may also result in a clearer error reporting. The other potential problem is if you are using BLAST>2.9. Please downgrade if necessary because we know there are incompatibilities. If downgrading BLAST does not solve the problem there may be problems with the file or contig names. Please look into the previous issues reported on this.

Best Regards,

Mario

kamivain commented 1 year ago

Dear Mario, Thank you for your reply,I would adopt your advise and try the program again. Best Regards, kamivain

Original Email

Sender:"ramirma"< @.*** >;

Sent Time:2023/8/14 16:08

To:"B-UMMI/chewBBACA"< @.*** >;

Cc recipient:"kamivain"< @. >;"Mention"< @. >;

Subject:Re: [B-UMMI/chewBBACA] Ecountering an issue when running AlleleCall(Issue #182)

Dear @kamivain,

Thank you for your interest in chewBBACA. Please have a look at issue #176. I note that you are using python 3.10. Althought this should not be a problem we do advise to use python 3.9, this may also result in a clearer error reporting. The other potential problem is if you are using BLAST>2.9. Please downgrade if necessary because we know there are incompatibilities. If downgrading BLAST does not solve the problem there may be problems with the file or contig names. Please look into the previous issues reported on this.

Best Regards,

Mario

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Fla1487 commented 10 months ago

I have the similar problem, but if I apply the command on a selection of the genomes it appears to be solved. Conversely, when applied on the second part I have agains the problem.

rfm-targa commented 10 months ago

Greetings @Fla1487,

Thank you for your interest in chewBBACA. Based on what you report, it might be related to issues in one or several input files (badly formatted files, special characters in the filename or sequence headers, etc). Updating to the latest version may also help, as it solves several issues in older versions. If you cannot find the cause of the issue, please share what's printed to the stdout, as it might include enough information to determine the type of issue.

Kind regards,

Rafael

artmisk13 commented 4 months ago

Edits: I have also tried to set the python to ver 3.9 and BLAST to version 2.9 but it came out with a new error (see file: allele_call_pubold12_err_new.txt allele_call_pubold12_err_new.txt)

Hi chewBBACA developer,

I encounter the exact error, but only to a subset of my genomes. So, initially, I tried to perform AlleleCall for 2500 genomes which failed due to the same error. Then I did multiple AlleleCall to 4 batches of 600-700 genomes, some of them succeeded, but some failed (N = 937 genomes). This is what I have done:

Below is the code chewBBACA.py AlleleCall -i ./fasta/ -g ./cgmlst_scheme -o ./output_pub-old1-1_2-3 --output-unclassified --output-missing --output-novel --cpu 8

The error file and output file of this run are attached. Please kindly look into this and what can you suggest for me to do? Thank you very much!

Best, Krisna allele_call_pubold12_err.txt allele_call_pubold12_out.txt

rfm-targa commented 4 months ago

Hello @artmisk13,

Thank you for reporting this issue. We know of more users who have encountered this bug under similar circumstances. Based on what users report, it should be related to a single or a set of input files. We never got the same issue or managed to reproduce the error even when users shared data. That is the reason why we could not look into this properly. This error is strange because chewBBACA cannot get a sequence that should be in the FASTA file. Could you share a minimal test case that leads to the same error? For example, we can use the schema, a subset of the schema loci, and a genome to find and fix the issue. Any data you share with us is handled privately; we will only use it for bug fixing (you can upload a Zip with the data to WeTransfer and send the link to imm-bioinfo@medicina.ulisboa.pt).

Also, part of the problem might be related to the environment configuration. If you are using a conda environment to run chewie, can you run conda list -n <ENV_NAME> --export > package-list.txt to get the list of packages in the environment? If you share that file with us, we can create an environment with the same packages as yours, which might help us identify the issue if it is related to any specific package.

Lastly, the BLAST error BLAST Database creation error: Multi-letters chain PDB id is not supported in v4 BLAST DB should be related to input files with a short, unique identifier (4 or fewer chars) and numeric only (e.g. 123.fna has 123 as a unique identifier). It should work if you change the file names to be composed of more than 4 chars or add a letter.

Let us know if you can share some data and if changing the file names fixes the BLAST error.

Best regards,

Rafael

artmisk13 commented 4 months ago

Hi Rafael,

Thanks for your thorough explanation and suggestions, they are really helpful!

  1. Change file names: As you suggested I changed the fasta file names which only have 3 characters and the same error still occurred. However, I changed all fasta file names so that they all contained letters and it finally worked! So for the sake of curiosity, I ran this using my initial chewie AlleleCall run setup (ver 3.2, BLAST 2.14, python 3.9.16), and:

So I'm guessing there is a problem somewhere in 1) reading the fasta files when the name only has numerical characters and 2) creating the "missing_classes.fasta" file when there is a '-' separator in the input fasta name (problem in string variable splitting?). The 2nd problem probably has been addressed in the newer chewie version. I hope this new information helps you further in debugging the AlleleCall module.

  1. Share data for debug: I'm happy to share the scheme and the genomes to help you debug the problem. The scheme is publicly available from PubMLST: H. influenzae cgMLST. The "unpublished" status is there just because the manuscript is still under peer-review*. The genomes are also publicly available (curated complete genomes from NCBI), and these are the isolate IDs these are the isolate IDs and you can download the contigs from PubMLST

*Once the manuscript is accepted for publication, I am happy to upload the scheme to Chewie-NS so more people can use it!

Best, Krisna

rfm-targa commented 4 months ago

Hello @artmisk13,

Thank you for sharing the details and data about the errors. It will help us a lot. We will probably change how IDs are processed internally to solve this kind of issue for good. Uploading the schema to Chewie-NS would be great. Just let us know when you'd like to do it, and we'll add you as a contributor or upload it if you prefer us to handle that. I will let you know when we have changed things.

Best regards,

Rafael

rfm-targa commented 3 months ago

Hello @artmisk13,

We released chewBBACA v3.3.9. This version includes changes to check if BLAST interprets input unique IDs as PDB chain IDs or if it modifies the IDs at all. We use makeblastdb to create BLAST databases (DBs) to search for matches based on lists of identifiers. To use the list of identifiers, we need to include -parse_seqids when creating the DBs, and that leads to the issue where BLAST modifies some of the sequence IDs in the FASTA used to make the DB. This is a problem when we cannot match the IDs recovered from the DB to those in the original FASTA file. To avoid this issue, chewBBACA will warn users about any input files affected whose unique IDs lead to the issue. To continue, users will have to rename the files. This is safer than accepting the files and changing/checking everything internally to ensure it works. Let us know if the latest version identifies the input files in your dataset that caused the issue.

Kind regards,

Rafael