B-UMMI / chewBBACA

BSR-Based Allele Calling Algorithm
GNU General Public License v3.0
134 stars 28 forks source link

Errno 2 FileNotFoundError : both for CreateSchema and AlleleCall #160

Closed ainus1022 closed 1 year ago

ainus1022 commented 1 year ago

Dear ChewBBACA Team,

Thank you for providing chewBBACA! I am trying to run cgMLST with version 3.1.0 on 85 Clostridium perfringens genomes. I downloaded the scheme from ridom cgMLST server. https://www.cgmlst.org/ncs/schema/15017225/ Then I ran the PrepExternalSchema command: PrepExternalSchema -i /Users/okazaki/Desktop/Chewbbaca/Schema_download -o schema_dir

I prepared a training file by running prodigal following command: prodigal -i /Users/okazaki/Desktop/Chewbbaca/inputfiles/ATCC13124.fasta -t training_file_ATCC13124.trn -p single

But when we ran AlleleCall command I get error. This is the command I ran: chewBBACA.py AlleleCall -i /Users/okazaki/Desktop/Chewbbaca/inputfiles -g /Users/okazaki/Desktop/Chewbbaca/schema_dir -o outdir/ --cpu 8 --ptf /Users/okazaki/Desktop/Chewbbaca/training_file_ATCC13124.trn

And this is the error Message: `Minimum sequence length: 0 Size threshold: 0.2 Translation table: 11 BLAST Score Ratio: 0.6 Word size: 5 Window size: 5 Clustering similarity: 0.2 Prodigal training file: /Users/okazaki/Desktop/Chewbbaca/training_file_ATCC13124.trn CPU cores: 8 BLAST path: /Users/okazaki/mambaforge/envs/chewbbaca/bin CDS input: False Prodigal mode: single Mode: 4 Number of inputs: 85 Number of loci: 1429

== CDS prediction ==

Predicting CDS for 85 inputs... [====================] 100%

== CDS extraction ==

Extracting predicted CDS for 85 inputs... [====================] 100% Extracted a total of 307239 CDS from 85 inputs.

== CDS deduplication ==

Identifying distinct CDS...Traceback (most recent call last): File "/Users/okazaki/mambaforge/envs/chewbbaca/bin/chewBBACA.py", line 10, in sys.exit(main()) ^^^^^^ File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main functions_info[process][1]() File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper func(*args, **kwargs) File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call AlleleCall.main(genome_list, loci_list, args.schema_directory, File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2671, in main results = allele_calling(input_files, schema_directory, temp_directory, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2004, in allele_calling dna_dedup_results = cf.exclude_duplicates(cds_files, dna_dedup_dir, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/core_functions.py", line 242, in exclude_duplicates cds_file = fo.concatenate_files(dedup_files, cds_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/file_operations.py", line 390, in concatenate_files with open(file, 'r') as infile: ^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: 'outdir/temp/3_cds_preprocess/cds_deduplication/distinct_cds_10.fasta'`

I checked the outdir/temp/3_cds_preprocess/cds_deduplication folder and found distinct_cds_10.fastafile is missing though other numbers (eg 9 or 11) are exist.

Also, when I tried to run CreateSchema command, I ran into the same Error. The command I run: ChewBBACA.py CreateSchema -i /Users/okazaki/Desktop/Chewbbaca/inputfiles -o schema_dir1 --cpu 8 --ptf /Users/okazaki/Desktop/Chewbbaca/training_file_ATCC13124.trn

The error message: `Prodigal training file: /Users/okazaki/Desktop/Chewbbaca/training_file_ATCC13124.trn CPU cores: 8 BLAST Score Ratio: 0.6 Translation table: 11 Minimum sequence length: 201 Size threshold: 0.2 Word size: 5 Window size: 5 Clustering similarity: 0.2 Representative filter: 0.9 Intra-cluster filter: 0.9 Number of inputs: 85

Predicting gene sequences...

[====================] 100%

Extracting coding sequences...

[====================] 100%

Extracted a total of 307239 coding sequences from 85 genomes.

Removing duplicated DNA sequences...Traceback (most recent call last): File "/Users/okazaki/mambaforge/envs/chewbbaca/bin/ChewBBACA.py", line 10, in sys.exit(main()) ^^^^^^ File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main functions_info[process][1]() File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper func(*args, kwargs) File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 211, in create_schema CreateSchema.main(vars(args)) File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/CreateSchema/CreateSchema.py", line 519, in main results = create_schema_seed(input_files, output_directory, schema_name, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/CreateSchema/CreateSchema.py", line 271, in create_schema_seed ds_results = cf.exclude_duplicates(cds_files, dna_dedup_dir, cpu_cores, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/core_functions.py", line 242, in exclude_duplicates cds_file = fo.concatenate_files(dedup_files, cds_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/okazaki/mambaforge/envs/chewbbaca/lib/python3.11/site-packages/CHEWBBACA/utils/file_operations.py", line 390, in concatenate_files with open(file, 'r') as infile: ^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: 'schema_dir1/temp/3_cds_preprocess/cds_deduplication/distinct_cds_10.fasta'`

Since I am just a beginner for bioinformatics, so maybe doing something weird. Thank you very much for any comment or suggestions!

Best regards, Aiko Okazaki

rfm-targa commented 1 year ago

Greetings @ainus1022,

Thank you for your interest in chewBBACA. I've adapted the schema from Ridom and performed allele calling with 58 complete genomes and 289 draft genome assemblies downloaded from the NCBI (only assemblies available through RefSeq and excluding any possible assemblies flagged as partial or anomalous). Everything completed without warnings or errors. I used a training file created based on the representative genome assembly for Clostridium perfringens (GCF_020138775.1). The issue you report may be caused by problems in the genome assemblies you're passing. chewBBACA checks for several issues in the input files, but it might be something we've not anticipated. If you haven't done it yet, please check the quality of the genome assemblies and exclude assemblies with many ambiguous bases, too many contigs, unexpected genome size, etc. It's also important to ensure that each input has a unique identifier (the basename of each input before the first .) and that everything's in FASTA format. I also suggest that you add a training file during schema adaptation, like so:

chewBBACA.py PrepExternalSchema -i /Users/okazaki/Desktop/Chewbbaca/Schema_download -o schema_dir --ptf trainingFile.trn

You won't need to define a training file for allele calling if you add it during schema adaptation. It is odd that it does not write one of the FASTA files during sequence deduplication. chewBBACA should capture any exceptions during that stage. One possibility is that it is not creating the file because it simply has no sequences to write into it. If you've checked the quality of the genome assemblies and nothing looks off, can you share the training file and the 85 genome assemblies? It would greatly help us to reproduce the issue and find a solution.

Rafael

ainus1022 commented 1 year ago

Dear Rafael @rfm-targa,

I appreciate your quick and kind reply. I understand this problem may come from the data quality. I ran quast(https://github.com/ablab/quast.git) for each fasta files, but could not find a file which is obviously different from others.

I ran PrepExternalSchema as you disproved, and it was successful ! Thereafter I ran Alelle call command with new schema, and ran into the same error.

Let me share my fasta files here: https://we.tl/t-YBQjmnURfg (88 files total : 3 added) 25 files named "CP with number" are the strains I newly sequenced, and the others are sequences from NCBI or bv-brc.

Best regards, Aiko

rfm-targa commented 1 year ago

Dear @ainus1022,

Thank you for sharing the data. I performed allele calling with the 88 FASTA files and encountered a different error than the one you shared. The cause of this error is probably the same as the one you're getting. The list of FASTA files includes two files with blank spaces in the filename, AAD 1527a and CPN 17a. The blank space leads to errors during sequence deduplication when chewBBACA uses BioPython to read FASTA files and cannot get the whole sequence identifier due to the blank space (if the sequence identifier is >CPN 17a-protein1000, the id attribute of a BioPython sequence record will only be CPN, and we'll get an error when chewBBACA tries to split the sequence id based on the -protein substring to get the genome identifier and the CDS identifier attributed by chewBBACA). Renaming the files to AAD_1527a and CPN_17a eliminated this problem, and the allele calling process completed successfully. I suggest you avoid including blank spaces in the names of input files, as it might lead to similar problems with other Bioinformatics software. The best approach is using simple and unique names (e.g., CP12, CP14, some examples from the dataset you shared). I also noticed that some FASTA files include coding sequences instead of contigs. chewBBACA expects to receive complete or draft genome assemblies in its default mode. If you cannot get the genome assemblies, I suggest you perform allele calling for the FASTA files with CDSs separately and pass the --cds parameter to chewBBACA to skip the gene prediction with Prodigal. Please let us know if renaming those files solves the issue.

Best regards,

Rafael

ainus1022 commented 1 year ago

Dear @rfm-targa

Thank you very much for checking the data and comment. Finally ChewBBACA ran without any error! I am really grateful to your help.

Aiko