B-UMMI / chewBBACA

BSR-Based Allele Calling Algorithm
GNU General Public License v3.0
129 stars 27 forks source link

Issue with fasta input files #169

Closed hkaspersen closed 1 year ago

hkaspersen commented 1 year ago

Dear authors, I am having some unexplained difficulties with running ChewBBACA on a set of Listeria genomes. The input files are regular fasta files which work fine in all other settings, assembled with SPAdes and corrected with Pilon. I also get the error when running ChewBBACA on genomes downloaded from NCBI.

This is the error I am getting:

Failed to predict CDS for 3 inputs.
Make sure that Prodigal runs in meta mode (--pm meta) if any input file has less than 100kbp.

Could not predict CDS for any of the input files.

Please provide input files in the accepted FASTA format.

I have attached three fasta files here. We are running ChewBBACA version 3.1.2.

test_assemblies.zip

Command:

chewBBACA.py AlleleCall -i assemblies/test_assemblies/ -g prepped_schema -o test_results --force-continue --output-missing --output-unclassified --mode 4 --cpu 4
rfm-targa commented 1 year ago

Hello @hkaspersen,

Thank you for your interest in chewBBACA and for sharing the genome assemblies. I've used the genome assemblies to create a schema and to perform allele calling. I could not reproduce the issue you've encountered. To create a new schema, I've used the following command:

chewBBACA.py CreateSchema -i test_assemblies -o lmonocytogenes_schema --ptf Listeria_monocytogenes.trn --cpu 4 --no-cleanup

With the following stdout:

chewBBACA version: 3.1.2
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

============================
  chewBBACA - CreateSchema
============================
Started at: 2023-03-22T11:31:35

Prodigal training file: /home/rmamede/Cloned_Repos/chewBBACA/CHEWBBACA/prodigal_training_files/Listeria_monocytogenes.trn
CPU cores: 4
BLAST Score Ratio: 0.6
Translation table: 11
Minimum sequence length: 201
Size threshold: 0.2
Word size: 5
Window size: 5
Clustering similarity: 0.2
Representative filter: 0.9
Intra-cluster filter: 0.9
Number of inputs: 3

Predicting gene sequences...

 [====================] 100%

Extracting coding sequences...

 [====================] 100%

Extracted a total of 8810 coding sequences from 3 genomes.

Removing duplicated DNA sequences...removed 3495 sequences.
Kept 5315 distinct sequences.

Removing sequences smaller than 201 nucleotides...removed 121 sequences.
Translating 5194 CDS...
 [====================] 100%
Identified 0 CDS that could not be translated.

Info about untranslatable and small sequences stored in /home/rmamede/Desktop/chewie312_test/lmonocytogenes_schema/invalid_cds.txt

Removing duplicated protein sequences...removed 812 sequences.

Kept 4382 sequences after filtering the initial sequences.
Clustering proteins...
 [====================] 100%
Clustered 4382 proteins into 3030 clusters.
Removing sequences based on high similarity with the cluster representative...removed 1138 sequences.
Identified 2834 singletons.
Remaining sequences after representative and singleton pruning: 3244
Removing sequences based on high similarity with other clustered sequences...removed 1 sequences.
Clusters to BLAST: 196
 [====================] 100%

Removed 210 sequences based on high BSR value with other sequences.
Total of 3033 sequences to compare in final BLAST.
Performing final BLASTp...
 [====================] 100%
Removed 13 sequences that were highly similar to other sequences.
Created schema seed with 3020 loci.

Finished at: 2023-03-22T11:31:54
Took  0m 19s.

I then performed allele calling with the schema and the same three genome assemblies. I used the following command:

chewBBACA.py AlleleCall -i test_assemblies -g lmonocytogenes_schema/schema_seed -o lmonocytogenes_results --force-continue --output-missing --output-unclassified --mode 4 --cpu 4

With the following stdout:

chewBBACA version: 3.1.2
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Documentation: https://chewbbaca.readthedocs.io/en/latest/index.html
Contacts: imm-bioinfo@medicina.ulisboa.pt

==========================
  chewBBACA - AlleleCall
==========================
Started at: 2023-03-22T11:58:52

Minimum sequence length: 0
Size threshold: 0.2
Translation table: 11
BLAST Score Ratio: 0.6
Word size: 5
Window size: 5
Clustering similarity: 0.2
Prodigal training file: /home/rmamede/Desktop/chewie312_test/lmonocytogenes_schema/schema_seed/Listeria_monocytogenes.trn
CPU cores: 4
BLAST path: /home/rmamede/.conda/envs/chewie312/bin
CDS input: False
Prodigal mode: single
Mode: 4
Number of inputs: 3
Number of loci: 3020

Determining sequence length mode for all loci...done.

Creating pre-computed hash tables...done.

== CDS prediction ==

Predicting CDS for 3 inputs...
 [====================] 100%

== CDS extraction ==

Extracting predicted CDS for 3 inputs...
 [====================] 100%
Extracted a total of 8810 CDS from 3 inputs.

== CDS deduplication ==

Identifying distinct CDS...identified 5315 distinct CDS.

== CDS exact matches ==

Searching for DNA exact matches...found 6314 exact matches (matching 3020 distinct alleles).
Unclassified CDS: 2295

== CDS translation ==

Translating 2295 CDS...
 [====================] 100%
Identified 0 CDS that could not be translated.
Information about untranslatable and small sequences stored in /home/rmamede/Desktop/chewie312_test/lmonocytogenes_results/temp/invalid_cds.txt
Unclassified CDS: 2295

== Protein deduplication ==

Identifying distinct proteins...identified 2285 distinct proteins.

== Protein exact matches ==

Searching for Protein exact matches...found 808 exact matches (808 distinct CDS, 808 total CDS).
Unclassified proteins: 1477

== Clustering ==

Translating schema's representative alleles...done.
Determining BLASTp raw score for each representative...done.
Creating minimizer index for representative alleles...done.
Created index with 225457 distinct minimizers for 3020 loci.
Clustering proteins...
 [====================] 100%
Clustered 1477 proteins into 1305 clusters.
Clusters to BLAST: 1305
 [====================] 100%
Classifying clustered proteins...
 [====================] 100%
Classified 1290 distinct proteins.
Unclassified proteins: 187

== Representative determination ==

Iteration 1
===========
Loci: 3020
BLASTing loci representatives against unclassified proteins...done.
Loci with high-scoring matches: 27
Classifying proteins...classified 36 proteins.
Selecting representatives for next iteration...selected 6 representatives.
Unclassified proteins: 151

Iteration 2
===========
Loci: 6
BLASTing loci representatives against unclassified proteins...done.
Loci with high-scoring matches: 2
Classifying proteins...classified 2 proteins.
Unclassified proteins: 149

== Wrapping up ==

Writing results_contigsInfo.tsv...done.
Writing paralogous_loci.tsv and paralogous_counts.tsv...done.
Detected number of paralogous loci: 0
Writing logging_info.txt...done.
Writing results_alleles.tsv...done.
Writing results_statistics.tsv...done.
Writing loci_summary_stats.tsv...done.
Writing FASTA file with unclassified CDSs...done.
Writing FASTA file with CDSs classified as ASM, ALM, NIPH, NIPHEM, PAMA, PLOT3, PLOT5 and LOTSC...done.
Classified a total of 8526 CDSs.
EXC: 6301
INF: 2070
NIPH: 44
NIPHEM: 19
ASM: 8
PLOT5: 3
PLOT3: 2
LOTSC: 0
ALM: 0
PAMA: 0
Added 2070 novel alleles to schema.
Added 6 representative alleles to schema.

Results available in /home/rmamede/Desktop/chewie312_test/lmonocytogenes_results

Finished at: 2023-03-22T11:59:28
Took  0m 35s.

I performed the same analysis without a training file, and it also completed successfully. I noticed that you might be using a schema that was adapted. To rule out issues related to schema adaptation, I adapted the schema with the PrepExternalSchema module and performed allele calling with and without a training file. Both processes completed successfully.

chewBBACA should detect most issues related with missing dependencies, but it's still important that you check their versions. Please verify and share the versions of Python, Prodigal, BLAST and other dependencies listed in the requirements. You should get an error if the issue is related to insufficient permissions. Still, please check that you have sufficient permissions to read the genome assemblies and write into the output directory (add the --no-cleanup flag to the command and verify that it can create the output directory, the temp directory inside the output directory and anything inside the temp directory). chewBBACA includes a function used to capture exceptions raised during multiprocessing, but I've noticed that that functionality might not work correctly on the latest Python version, v3.11. Some exceptions might slip through unnoticed if you're using that Python version hence why you might be only getting that warning when everything else seems to run smoothly. If everything is in order, creating a new Conda environment that only includes chewBBACA 3.1.2 and its dependencies (I advise you to use Python v3.9 and BLAST 2.9) might solve the issue and help identify its cause. Please let me know if this helps to solve the issue.

Rafael

hkaspersen commented 1 year ago

Dear @rfm-targa, thank you for this thorough reply! Unfortunately I was asking for help a bit too fast, the issue seemed to be related to the prodigal training file not being correct. Apologies!

rfm-targa commented 1 year ago

Greetings @hkaspersen,

Thank you for sharing the cause of the issue. I'll test it with several possibilities and improve the warning when this kind of issue arises.

Best regards,

Rafael