BackofenLab / CRISPRcasIdentifier

Machine learning for accurate identification and classification of CRISPR-Cas systems
GNU General Public License v3.0
20 stars 6 forks source link

IndexError before creating predictions.csv #2

Closed almutwerner closed 3 years ago

almutwerner commented 3 years ago

Hello!

I use CRISPRcasIdentifier in a conda environment on a cluster with slurm. The casette and hmmsearch folders are created normally, but when creating predictions.csv , IndexErrors occur and the script aborts. This is the case for all of my files. With the test files, everything went smoothly.

My commands: module load miniconda3/4.7.12.1 source activate crispr-env

cd /work_beegfs/sunam157/CRISPRcasIdentifier

for x in $(cat /work_beegfs/sunam157/MAG/all_1169_split_faa.txt); do echo "$x"; python CRISPRcasIdentifier.py -f /work_beegfs/sunam157/MAG/MAGs_1169_faa_split/$x -p -st protein -ho /work_beegfs/sunam157/MAG/CRISPRcasIdentifier_out/$x/hmmsearch/ -co /work_beegfs/sunam157/MAG/CRISPRcasIdentifier_out/$x/casette/ -o /work_beegfs/sunam157/MAG/CRISPRcasIdentifier_out/$x/predictions.csv ;done

Output into log.out: ANOR1.faa Extracting /work_beegfs/sunam157/CRISPRcasIdentifier/HMM_sets.tar.gz Extracting /work_beegfs/sunam157/CRISPRcasIdentifier/trained_models_2015.tar.gz Running hmmsearch (log and outputs stored in /work_beegfs/sunam157/MAG/CRISPRcasIdentifier_out/ANOR1.faa/hmmsearch/) Annotating proteins Building cassettes Saving cassette(s) to /work_beegfs/sunam157/MAG/CRISPRcasIdentifier_out/ANOR1.faa/casette/HMM1_cassette_arrays.txt Saving cassette(s) to /work_beegfs/sunam157/MAG/CRISPRcasIdentifier_out/ANOR1.faa/casette/HMM3_cassette_arrays.txt Saving cassette(s) to /work_beegfs/sunam157/MAG/CRISPRcasIdentifier_out/ANOR1.faa/casette/HMM5_cassette_arrays.txt

--------------------------------------------------

There are 1932 unlabeled proteins for cassette # 1 and HMM1 More than 2 missing proteins. Regression predictions will likely be weak. ERT missing bit-score prediction for cassette #1, HMM1 and cas9 (1/1932): 0.090 ERT missing bit-score prediction for cassette #1, HMM1 and csa5 (2/1932): 0.082 ERT missing bit-score prediction for cassette #1, HMM1 and cas12 (3/1932): 0.073 ERT missing bit-score prediction for cassette #1, HMM1 and csb1 (4/1932): 0.050 ERT missing bit-score prediction for cassette #1, HMM1 and cmr5 (5/1932): 0.034 ERT missing bit-score prediction for cassette #1, HMM1 and csb2 (6/1932): 0.021 ERT missing bit-score prediction for cassette #1, HMM1 and csf1 (7/1932): 0.015 ERT missing bit-score prediction for cassette #1, HMM1 and cas8 (8/1932): 0.011 ERT missing bit-score prediction for cassette #1, HMM1 and csm6 (9/1932): 0.006 ERT missing bit-score prediction for cassette #1, HMM1 and cse2 (10/1932): 0.000 ERT missing bit-score prediction for cassette #1, HMM1 and cas11 (11/1932): 0.000 ERT missing bit-score prediction for cassette #1, HMM1 and DinG (12/1932): 0.000 ERT missing bit-score prediction for cassette #1, HMM1 and csm2 (13/1932): 0.000 ERT missing bit-score prediction for cassette #1, HMM1 and cmr7 (14/1932): 0.000 ERT missing bit-score prediction for cassette #1, HMM1 and csb3 (15/1932): 0.000 ERT missing bit-score prediction for cassette #1, HMM1 and csa3 (16/1932): 0.000

output in log.err:

Traceback (most recent call last): File "CRISPRcasIdentifier.py", line 417, in hmm_cassettes_reg = predict_missings(MODELS_DIR, reg, hmm_features, hmm_cassettes, hmm_missings) File "CRISPRcasIdentifier.py", line 286, in predict_missings j, f, pred = predictions[i] IndexError: list index out of range

The zipped folder contains the output for one file plus the .faa itself.

Maybe you know, what causes this error?

Almut

ANOR1.faa.tar.gz

padilha commented 3 years ago

Hello @Camaika Thanks for the report. I will reproduce this error on my machine and hope to fix it this weekend.

I will get back to you soon.

Best Victor.

padilha commented 3 years ago

Hello again @Camaika

You are giving a protein fasta file containing a total of 2003 proteins as input to CRISPRcasIdentifier, right? Does this file contain (ANOR1.faa) all the proteins extracted for an entire genome? Note that, when you give CRISPRcasIdentifier a protein fasta as input, it expects that this file refers to only one cassette (usually <= 15 proteins). We explicitly mention this detail in the README file:

"If -st is set to protein, CRISPRcasIdentifier assumes that the input fasta file contains only one cassette. For such, the expected cassette length is up to 15 proteins (more than that might produce unexpected results). If -st is set to dna, CRISPRcasIdentifier tries to build the protein cassettes after extracting the protein sequences using Prodigal".

If you have the original DNA files, then you can run the tool using the option -st dna and setting the -sc parameter to complete or partial. In this case, the tool extracts the proteins using Prodigal and tries to build the genome's cassettes automatically (then, you might have results for more than one cassette). If, for some reason, you don't have access to the original DNA files of the organisms you are studying, please ping me again in this issue, and I will discuss with my collaborators some workaround to extract multiple cassettes from a protein fasta file.

Best Victor.

almutwerner commented 3 years ago

Hello @padilha

Thank you for your help! I am indeed working with whole genomes and should have thought about the cassette issue myself, my bad. Luckily, I still do have the DNA files and will use them as input (edit: and they ran through without any more issues \0/).

Thank you for your time Almut