gatech-genemark / ProtHint

Protein hint generation pipeline for gene finding in eukaryotic genomes
Other
56 stars 13 forks source link

Error: Error detecting input file format. First line seems to be blank. #45

Closed qussai96 closed 2 years ago

qussai96 commented 2 years ago

Hi Tomas, I am trying to run prothint with seeds generated by braker. I am getting the following error:

#CPU threads: 16
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory:
#Target sequences to report alignments for: 25
Opening the database...  [0.109s]
Database: diamond_db.dmnd (type: Diamond database, sequences: 24502072, letters: 9300670097)
Block size = 2000000000
Opening the input file...  [0s]
Error: Error detecting input file format. First line seems to be blank.
[Thu Aug 25 20:19:17 2022] error: ProtHint exited due to an error in command: /home/qabbas/Plants/Tools/ProtHint-2.6.0/bin/../dependencies/diamond blastp --query ../seed_proteins.faa --db diamond_db --outfmt 6 qseqid sseqid --out diamond.out --max-target-seqs 25 --max-hsps 1 --threads 16 --evalue 0.001

I tried to change diamond version in dependencies folder with the latest version but this didn't help, and I am still getting the same error.

The command I am running is: prothint.py --threads=16 --workdir=ProtHint_dir --geneSeeds=./braker_seeds/augustus.hints.gtf GCF_000001735.4_TAIR10.1_genomic.fna.masked orthodb_without_AT.fasta

Thanks,

qussai96 commented 2 years ago

it seems like the problem is with the script "proteins_from_gtf.pl" as the seed_proteins.faa file is empty. After checking my seeds file "augustus.hints.gtf" it appears that it has unexpected format. comparing it to your example/input/genemark.gtf file in [https://github.com/gatech-genemark/ProtHint/tree/master/example/input], I found that my seeds file contains gene and transcript predictions, not only exons, introns and CDS. Any ideas on how to solve this issue?

Thanks,

tomasbruna commented 2 years ago

ProtHint should still work with the BRAKER seeds (anything other than the CDS lines are ignored).

Can you share your input files (by email to bruna.tomas@gmail.com if you don't want to share here)? I will try to run ProtHint myself and reproduce this error.

Best, Tomas

qussai96 commented 2 years ago

Thank you @tomasbruna for your reply. I have sent the input files by email.

best,

tomasbruna commented 2 years ago

Hi Qussai,

thanks for sending the files. The error is caused by a mismatch in contig names between augustus.hints.gtf and GCF_000001735.4_TAIR10.1_genomic.fna. One of them has spaces between words:

NC_003070.9 Arabidopsis thaliana chromosome 1 sequence

and the other one uses underscores:

NC_003070.9_Arabidopsis_thaliana_chromosome_1_sequence

It is possible that the underscores were automatically added over the course of a ProtHint run. In any case, the error can be fixed by renaming the contigs to make sure they are matching.

Best, Tomas

qussai96 commented 2 years ago

Thank you, Tomas! I renamed the contigs and everything is working perfectly now.

cheers, Qussai