Closed DaveLutgen closed 1 year ago
Hi Dave,
Thank you for using TAMA.
Could you copy paste the command you used to create the blastp file? Could you also copy paste some lines from the input files?
This seems like it might be some unexpected formatting from one of the input files.
Thank you, Richard
Hi Richard,
Thank you for such a great set of tools!
Please find attached the command used to generate the output. I noticed that I mislabelled the output file with type 6 :
blastp \
-query $i \
-db /gpfs1/data/oenanthe/scripts/dave_scripts/annotation_scripts/GeMoMa/uniprot_sprot.fasta \
-max_target_seqs 1 \
-outfmt 0 \
-evalue 1e-5 \
-num_threads 16 > "/data/oenanthe/dave/Gemoma/reference_genome_annotation/isoseq/tama_orf_nmd/"$prefix"_10-5_new.outfmt6"
Attached is the first part of the output file. I have 7 tissues, one of which produces a file with less than 1G and for this one running tama_orf_blastp_parser.py works without any error message.
Thanks again for your help
Hi Dave,
Ok it looks like maybe there is a formatting issue in the blastp file. Could you share the last 10 lines of that input file?
I am hoping we can find the issue at the end of file because if not it means the issue is somewhere internally which will be a bit harder to find.
Thank you, Richard
Hi Richard,
Please find attached the last 10 lines of the blastp output file:
Effective search space used: 5107146928
Database: uniprot_sprot.fasta
Posted date: Jul 15, 2022 3:55 PM
Number of letters in database: 204,940,973
Number of sequences in database: 567,483
Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Neighboring words threshold: 11
Window for multiple hits: 40
Thanks again for your support. Cheers, Dave
Hi Dave,
It looks like there is something weird about how the spaces in your blastp file are represented.
Can you try running this on your blastp file?
cat Brain_aa_new.outfmt6 | grep "Sbjct" | awk -F "" '{print $NF}' | sort -n | uniq -c
Can you share the results here?
Thank you, Richard
Hi Richard,
The results look as follows:
cat Brain_aa_new.outfmt6 | grep "Sbjct" | awk -F "" '{print $NF}' | sort -n | uniq -c
2
2122322 0
1464518 1
1474407 2
1448375 3
1426009 4
1416237 5
1427693 6
1400044 7
1447065 8
1483021 9
All the best, Dave
Hi Dave,
Ok something seems strange here with the formatting of that file. I also noticed that the name of the extension is "outfmt6" but in your command you specified out format 0. Do you know why these are not matched?
Were there different format files outputted from the run or from other runs that might have gotten mixed up? Did the output file get transferred from different computers? For example, was it downloaded from a server?
Thank you, Richard
Hi Richard,
For the first part, this is only a naming error. I left the outfmt6 in the name, even though it is output format 0.
All was done on the same compute cluster. I rerun the fastp command just to be sure that it was not transferred, mixed up or any of these. Nonetheless, the error still remains. Interestingly, if I run tama_orf_blastp_parser.py on a subset of the Brain_aa_new.outfmt6 file, first few thousand lines, the script does not produced the error message. I guess this means that at some point fastp generates output at a later stage that is not recognized by your python script.
Thank you! Dave
Hi Dave,
I just made a change to the tama_orf_blastp_parser.py script which should help us find the formatting issue here.
Can you clone/update your TAMA again and try running the full input file again? It should print out the line that is causing the problems and the line number. Could you copy and paste the offending line with 5 lines before and 5 lines after so I can see what is going on in the file?
Thank you, Richard
Hi Richard,
Please find below the ouput as requested
+ G+ H++G+GS K AHS DEKWG + KSDQS+D E++QQ
Sbjct 181 GSPRSEAGSGHSDPGTPHTDGEGSGKDAHSGDEKWGGDGKSDQSED---------EDKQQ 231
Query 177 NSDDEEKVQNSDEDERPQVSDDEERLQNSDEEKMQNSDDEERPQASDEEKMQNSDDDERA 236
Sbjct ------------------------------------------------------------
Query 237 QRSDEEKMQNSDDEERAQHSDEEEQEHKSESARGSDSEDEVLRMKRKKPIA--SDSEAES 294
NSDDE + SDEE + KSES +GSDSED+ R K+KK IA SDS++++
Sbjct 232 ---------NSDDER--ERSDEEGERQKSESIKGSDSEDDFTRKKKKK-IASDSDSDSDA 279
Thank you. All the best, Dave
Hi Dave,
Ok this makes sense now. the issue was with this line not having start and end coordinates for the Sbjct sequence.: Sbjct ------------------------------------------------------------
I have not seen this before and it looks like a dubious hit but I have now updated the script in the repo to handle this situation.
Could you reclone and try again and let me know if it runs ok now?
Thank you, Richard
Hi Richard,
It ran successfully with the following output
opening blastp file opening blastp file Strange formatting of Sbjct line Sbjct ------------------------------------------------------------ Oddity on line: 44605147 Strange formatting of Sbjct line Sbjct ------------------------------------------------------------ Oddity on line: 44605659
Thanks a ton!
All the best,
Dave
That's great Dave!
Thank you again for using TAMA and please reach out again if you have any other questions or issues.
Best of luck, Richard
Hi,
I'm trying to run the tama_orf_blastp_parser.py, but it seems to break down with blast output files bigger than 1G.
Input
python2 /data/oenanthe/scripts/dave_scripts/annotation_scripts/TAMA/tama_go/orf_nmd_predictions/tama_orf_blastp_parser.py -b /data/oenanthe/dave/Gemoma/reference_genome_annotation/isoseq/tama_orf_nmd/Brain_aa_new.outfmt6 -o /data/oenanthe/dave/Gemoma/reference_genome_annotation/isoseq/tama_orf_nmd/test.gff
Commannd line opening blastp file opening blastp file
Error message Traceback (most recent call last): File "/data/oenanthe/scripts/dave_scripts/annotation_scripts/TAMA/tama_go/orf_nmd_predictions/tama_orf_blastp_parser.py", line 381, in
s_end = line.split()[3]
IndexError: list index out of range
Any help would be appreciated! Thank you very much. All the best, Dave