GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

IndexError: list index out of range #84

Closed DaveLutgen closed 1 year ago

DaveLutgen commented 1 year ago

Hi,

I'm trying to run the tama_orf_blastp_parser.py, but it seems to break down with blast output files bigger than 1G.

Input

python2 /data/oenanthe/scripts/dave_scripts/annotation_scripts/TAMA/tama_go/orf_nmd_predictions/tama_orf_blastp_parser.py -b /data/oenanthe/dave/Gemoma/reference_genome_annotation/isoseq/tama_orf_nmd/Brain_aa_new.outfmt6 -o /data/oenanthe/dave/Gemoma/reference_genome_annotation/isoseq/tama_orf_nmd/test.gff

Commannd line opening blastp file opening blastp file

Error message Traceback (most recent call last): File "/data/oenanthe/scripts/dave_scripts/annotation_scripts/TAMA/tama_go/orf_nmd_predictions/tama_orf_blastp_parser.py", line 381, in s_end = line.split()[3] IndexError: list index out of range

Any help would be appreciated! Thank you very much. All the best, Dave

GenomeRIK commented 1 year ago

Hi Dave,

Thank you for using TAMA.

Could you copy paste the command you used to create the blastp file? Could you also copy paste some lines from the input files?

This seems like it might be some unexpected formatting from one of the input files.

Thank you, Richard

DaveLutgen commented 1 year ago

Hi Richard,

Thank you for such a great set of tools!

Please find attached the command used to generate the output. I noticed that I mislabelled the output file with type 6 :

blastp \
-query $i \
-db /gpfs1/data/oenanthe/scripts/dave_scripts/annotation_scripts/GeMoMa/uniprot_sprot.fasta \
-max_target_seqs 1 \
-outfmt 0 \
-evalue 1e-5 \
-num_threads 16 > "/data/oenanthe/dave/Gemoma/reference_genome_annotation/isoseq/tama_orf_nmd/"$prefix"_10-5_new.outfmt6"

Attached is the first part of the output file. I have 7 tissues, one of which produces a file with less than 1G and for this one running tama_orf_blastp_parser.py works without any error message.

head_output.txt

Thanks again for your help

GenomeRIK commented 1 year ago

Hi Dave,

Ok it looks like maybe there is a formatting issue in the blastp file. Could you share the last 10 lines of that input file?

I am hoping we can find the issue at the end of file because if not it means the issue is somewhere internally which will be a bit harder to find.

Thank you, Richard

DaveLutgen commented 1 year ago

Hi Richard,

Please find attached the last 10 lines of the blastp output file:

Effective search space used: 5107146928

  Database: uniprot_sprot.fasta
    Posted date:  Jul 15, 2022  3:55 PM
  Number of letters in database: 204,940,973
  Number of sequences in database:  567,483

Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Neighboring words threshold: 11
Window for multiple hits: 40

Thanks again for your support. Cheers, Dave

GenomeRIK commented 1 year ago

Hi Dave,

It looks like there is something weird about how the spaces in your blastp file are represented.

Can you try running this on your blastp file?

cat Brain_aa_new.outfmt6 | grep "Sbjct" | awk -F "" '{print $NF}' | sort -n | uniq -c

Can you share the results here?

Thank you, Richard

DaveLutgen commented 1 year ago

Hi Richard,

The results look as follows:

cat Brain_aa_new.outfmt6 | grep "Sbjct" | awk -F "" '{print $NF}' | sort -n | uniq -c
      2
2122322 0
1464518 1
1474407 2
1448375 3
1426009 4
1416237 5
1427693 6
1400044 7
1447065 8
1483021 9

All the best, Dave

GenomeRIK commented 1 year ago

Hi Dave,

Ok something seems strange here with the formatting of that file. I also noticed that the name of the extension is "outfmt6" but in your command you specified out format 0. Do you know why these are not matched?

Were there different format files outputted from the run or from other runs that might have gotten mixed up? Did the output file get transferred from different computers? For example, was it downloaded from a server?

Thank you, Richard

DaveLutgen commented 1 year ago

Hi Richard,

For the first part, this is only a naming error. I left the outfmt6 in the name, even though it is output format 0.

All was done on the same compute cluster. I rerun the fastp command just to be sure that it was not transferred, mixed up or any of these. Nonetheless, the error still remains. Interestingly, if I run tama_orf_blastp_parser.py on a subset of the Brain_aa_new.outfmt6 file, first few thousand lines, the script does not produced the error message. I guess this means that at some point fastp generates output at a later stage that is not recognized by your python script.

Thank you! Dave

GenomeRIK commented 1 year ago

Hi Dave,

I just made a change to the tama_orf_blastp_parser.py script which should help us find the formatting issue here.

Can you clone/update your TAMA again and try running the full input file again? It should print out the line that is causing the problems and the line number. Could you copy and paste the offending line with 5 lines before and 5 lines after so I can see what is going on in the file?

Thank you, Richard

DaveLutgen commented 1 year ago

Hi Richard,

Please find below the ouput as requested

                        + G+ H++G+GS K AHS DEKWG + KSDQS+D         E++QQ
Sbjct  181  GSPRSEAGSGHSDPGTPHTDGEGSGKDAHSGDEKWGGDGKSDQSED---------EDKQQ  231

Query  177  NSDDEEKVQNSDEDERPQVSDDEERLQNSDEEKMQNSDDEERPQASDEEKMQNSDDDERA  236

Sbjct       ------------------------------------------------------------

Query  237  QRSDEEKMQNSDDEERAQHSDEEEQEHKSESARGSDSEDEVLRMKRKKPIA--SDSEAES  294
                     NSDDE   + SDEE +  KSES +GSDSED+  R K+KK IA  SDS++++
Sbjct  232  ---------NSDDER--ERSDEEGERQKSESIKGSDSEDDFTRKKKKK-IASDSDSDSDA  279

Thank you. All the best, Dave

GenomeRIK commented 1 year ago

Hi Dave,

Ok this makes sense now. the issue was with this line not having start and end coordinates for the Sbjct sequence.: Sbjct ------------------------------------------------------------

I have not seen this before and it looks like a dubious hit but I have now updated the script in the repo to handle this situation.

Could you reclone and try again and let me know if it runs ok now?

Thank you, Richard

DaveLutgen commented 1 year ago

Hi Richard,

It ran successfully with the following output opening blastp file opening blastp file Strange formatting of Sbjct line Sbjct ------------------------------------------------------------ Oddity on line: 44605147 Strange formatting of Sbjct line Sbjct ------------------------------------------------------------ Oddity on line: 44605659 Thanks a ton! All the best, Dave

GenomeRIK commented 1 year ago

That's great Dave!

Thank you again for using TAMA and please reach out again if you have any other questions or issues.

Best of luck, Richard