IndexError: list index out of range

Utkarsha-Mahanta commented 9 months ago

I encounter the following error: InputFunctionException in rule hmm_1_against_prots in file odp/scripts/odp, line 2136 Traceback: File "odp/scripts/odp", line 2164, in (rule hmm_1_against_prots, line 2453, odp/scripts/odp)

Command used: snakemake --cores 4 --snakefile odp/scripts/odp

conchoecia commented 9 months ago

Hi @Utkarsha-Mahanta, thank you. The config file you sent me via email looks fine, and everything is working on my end with my test dataset. Is there a way you can compress and send me your genomes and config file? I think this may be a problem particular with your dataset and I will need to take a closer look to figure out what is causing the bug.

Utkarsha-Mahanta commented 9 months ago

Hi Darrin, I am not able to attach the data in this email thread as it exceeds the size limit, so I am sending you the data in the other mail thread.

Regards

Utkarsha Mahanta

DST-INSPIRE Fellow at SharmaG_omics Lab https://sites.google.com/view/sharmaglab/people?authuser=0

Department of Biotechnology, Indian Institute of Technology Hyderabad,

Sangareddy, Telangana, India 502285

https://scholar.google.com/citations?user=qSktj9UAAAAJ&hl=en&oi=ao https://github.com/Utkarsha-Mahanta https://orcid.org/0000-0002-7543-7931

On Sat, Feb 3, 2024 at 2:49 PM darrin t schultz @.***> wrote:

Hi @Utkarsha-Mahanta https://github.com/Utkarsha-Mahanta, thank you. The config file you send me via email looks fine, and everything is working on my end with my test dataset. Is there a way you can compress and send me your genomes and config file? I think this may be a problem particular with your dataset and I will need to take a closer look to figure out what is causing the bug.

— Reply to this email directly, view it on GitHub https://github.com/conchoecia/odp/issues/64#issuecomment-1925239426, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBWBXHKDOH4355DSZCJY22TYRX6ILAVCNFSM6AAAAABCX3FGLGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGIZTSNBSGY . You are receiving this because you were mentioned.Message ID: @.***>

conchoecia commented 9 months ago

Hi @Utkarsha-Mahanta - I ran scripts/odp from the main branch on your dataset and got this error:

RuleException:
OSError in file /scratch/molevo/dts/odp_main_MAINTONLY/odp/scripts/odp, line 304:
*********************************************************************
* ERROR:
*  Some protein sequences in your file are identical.
*  Each protein sequence must be unique.
*
*  The protein fasta with the problem is: data/AC_translated_cds.fasta
*  There are 1 duplicate sequences.
*  Here are the first 1 to 3:
*    - example
*
*  The reason this is problematic is that duplicate protein seqs
*   may interfere with proper reciprocal blastp match detection.
*
*  Please remove the identical sequences from the protein fasta
*   file, regenerate the chrom files, and try again.
*********************************************************************

I think you must have had the duplicate_proteins: "pass" line in your config file because you encountered an error at a later step, hmm_1_against_prots.

Anyway, I realized that my error message isn't completely informative of how to move forward when encountered, so I added some more information:

OSError in file /scratch/molevo/dts/odp_main_MAINTONLY/odp/scripts/odp, line 304:
*********************************************************************
* ERROR:
*  Some protein sequences in your file are identical.
*  Each protein sequence must be unique.
*  If you aren't sure why, PLEASE READ THESE TWO LINKS below:
*    - https://github.com/conchoecia/odp/issues/49
*    - https://github.com/conchoecia/odp/issues/62
*
*  The protein fasta with the problem is: data/Ad1_translated_cds.fasta
*  There are 14 duplicate sequences.
*  Here are the first 1 to 3:
*    - prot1
*    - prot1
*    - prot1
*
*  The reason this is problematic is that duplicate protein seqs
*   may interfere with proper reciprocal blastp match detection.
*
*  Please remove the identical sequences from the protein fasta
*   file, regenerate the chrom files, and try again.
*                          -- OR --
*  !!! IF YOU WANT THIS ERROR TO GO AWAY without modifying your data,
*   set the 'duplicate_proteins' line in your 'config.yaml'
*   to "pass". The default is "fail"
*  !!! In other words, add this line to your 'config.yaml' file:
*   ```
*   duplicate_proteins: "pass"
*   ```
*
*  Your final file would look something like this:
*   ```
*   ignore_autobreaks: True
*   diamond_or_blastp: "diamond"
*   duplicate_proteins: "pass"
*   plot_LGs: True
*   plot_sp_sp: True
*   species:
*     Celegans:
*       proteins: /path/to/proteins_in_Cel_genome.fasta
*       chrom: /path/to/Cel_genome_annotation.chrom
*       genome: /path/to/Cel_genome_assembly.fasta
*       minscafsize: 1000000  # Only plots scaffolds that are 1 Mbp or longer
*     Homosapiens:
*       proteins: /path/to/Human_prots.fasta
*       chrom: /path/to/Human_annotation.chrom
*       genome: /path/to/Human_genome_assembly.fasta
*       minscafsize: 8000000  # Only plots scaffolds that are 8 Mbp or larger
*   ```
*********************************************************************

I then reran the program with the duplicate_proteins: "pass" and I recovered the same problem you had.

I realized that your protein sequence IDs do not have the same sequence as the protein IDs that are specified in the chrom file. For the program to work, it needs to be able to match up the coordinates of the protein in the genome with the protein sequence. For the sample that I checked, none of the protein IDs that appear in the chrom file appear in the protein fasta file. For example, one protein in the first column of the chrom file was an accession ID, WP_011419062.1, whereas the protein headers do not match at all, for example lcl|NC_007760.1_prot_1031.

I corrected the logic of the function that was supposed to catch this error (it wasn't working), and updated the error message to something more useful. You'll have to change your input data for the pipeline to finish.

*********************************************************************
* ERROR:
*  Some proteins in the .chrom file were not seen in the protein
*   .fasta file. This is problematic, because it could indicate
*   missing data, a problem generating the chrom file, or a problem
*   generating the protein fasta file.
*
*  The chrom file with the problem is: data/AC_annot.chrom
*  There are 4421 proteins in the .chrom not seen in the protein .fasta
*  Here are the first 1 to 3:
*    - prot1
*    - prot2
*    - prot3
*
*  The reason this is problematic is that we need to access every
*   protein specified in the .chrom file, but it is unavailable.
*
*  For example, in you sample one of the proteins we found in the first file of your chrom file was: prot1
*   This means that in the protein fasta file, there needs to be an entry with the same name.
*   In the protein fasta file, there should be one protein that looks like this, with a > character, the protein ID from the chrom file,
*   and then a newline or a space character ' '
*   ```
*   >prot1 Optional sequence description here. Only the name up to the first space character or newline matters.
*   MSNKKRN... (the protein's sequences on this and subsequent lines.)
*   >Next_protein_sequence
*   MNELSKENNIE....
*   ```
*  Please investigate whether there are too many entries in the .chrom
*   file, or if something is missing from the protein .fasta file.
*   Then, fix your files and re-run this pipeline.
*
* If you need more help, please visit:
*  https://github.com/conchoecia/odp?tab=readme-ov-file#chrom-file-specifications
*
*********************************************************************

conchoecia commented 9 months ago

@Utkarsha-Mahanta - closing for now. Please open the issue again if the pipeline does not complete after you are sure that the protein fasta file headers and chrom file column 1 values are the same.

conchoecia / odp

IndexError: list index out of range #64