Error in jackhmmer when using .gz compressed file as target database

zhangnan0107 commented 3 years ago

Hi all,

I got an error when running jackhmmer by using a compressed file as <seqdb>, which should be fine according to the user guide. However, it just didn't work and threw an error. I also tried a small uncompressed file with seqs extracted from the original .gz file and that worked. So it seems to be related to the format. Would you please help me with this problem? By the way, gunzip is installed in my system.

Command line: jackhmmer -o result/test.output -A result/test.alignment --tblout result/test.tblout --domtblout result/test.domtblout --notextw test.fa uniref100.fasta.gz

Error:

Error: Target sequence file uniref100.fasta.gz isn't rewindable; jackhmmer requires that it is

test.fa:

>1433G_HUMAN MVDREQLVQKARLAEQAERYDDMAAAMKNVTELNEPLSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTSADGNEKKIEMVRAYREKIEKELEAVCQDVLSLLDNYLIKNCSETQYESKVFYLKMKGDYYRYLAEVATGEKRATVVESSEKAYSEAHEISKEHMQPTHPIRLGLALNYSVFYYEIQNAPEQACHLAKTAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDDDGGEGNN

head of uniref100.fasta.gz:

>UniRef100_Q6GZX4 Putative transcription factor 001R n=2 Tax=Frog virus 3 TaxID=10493 RepID=001R_FRG3G MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPS EKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLD AKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHL EKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDD SFRKIYTDLGWKFTPL >UniRef100_Q6GZX3 Uncharacterized protein 002L n=3 Tax=Frog virus 3 TaxID=10493 RepID=002L_FRG3G MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASGFCTSQPLCAR IKKTQVCGLRYSSKGKDPLVSAEWDSRGAPYVRCTYDADLIDTQAQVDQFVSMFGESPSL AERYCMRGVKNTAGELVSRVSSDADPAGGWCRKWYSAHRGPDQDAALGSFCIKNPGAADC KCINRASDPVYQKVKTLHAYPDQCWYVPCAADVGELKMGTQRDTPTNCPTQVCQIVFNML DDGSVTMDDVKNTINCDFSKYVPPPPPPKPTPPTPPTPPTPPTPPTPPTPPTPRPVHNRK VMFFVAGAVLVAILISTVRW

cryptogenomicon commented 3 years ago

Apologies, the user guide is wrong then. jackhmmer doesn't accept standard input or .gz files as the database. It needs to rewind the database file back to the beginning, to be able to iterate the search. Standard input can't be rewound. And the simple way we treat .gz files is to treat them as standard input streams, by opening a pipe from an external decompression program (specifically, gzip -dc). It'd be possible to just close and reopen the .gz file, but our code doesn't do that currently; we just treat both stdin and .gz as "nonrewindable", and programs that require rewindable files will report an error like the one you got.

In commit 105b959f8, I've just made two changes to the documentation. The jackhmmer man page now includes:

The
.I seqdb 
needs to be a 'normal' sequence file. It cannot be read from a stdin stream, because
.B jackhmmer
needs to do multiple passes over the database. It cannot be a
compressed (gzipped) file either, because we treat gzipped files
essentially as stdin streams, calling an external decompression
program.

The userguide now includes, in the .gz compressed files section in formats.tex:

A significant exception is \mono{jackhmmer}, which needs to do
multiple iterations over the same input sequence database. Because it
needs to rewind and read the same target database more than once,
\mono{jackhmmer} won't accept a standard input stream as the database
argument (unlike most other programs); and because we treat compressed
files as standard input streams, by opening a pipe from an external
decompression program, \mono{jackhmmer} won't accept a compressed seq
database file either. If a program can't accept stdin or a .gz file,
it'll tell you, by failing with an informative error message.

If you see another place in the documentation that should be improved on this, let me know. Thanks!

zhangnan0107 commented 3 years ago

I see, thanks a lot for the information and the update on docs. If you don't mind, can I ask will the compressed format be considered in jackhmmer in future releases?

The user guide is quite informative and easy-to-read, I found two places in jackhmmer's Options part, which I guess might just be writing errors? But this is not a big problem.

Thanks!

cryptogenomicon commented 3 years ago

Thanks, yes - those are both typos. First should say --domE, second should say --incdomE. I just checked in a fix.

I'll try to add support for compressed format reading in jackhmmer but honestly, it won't be a high priority. The performance hit for uncompressing the data is large. You want to avoid running searches on compressed files if you can. The capability would really just be there for an occasional convenience, not something you'd want in routine use.

zhangnan0107 commented 3 years ago

Yes, you are right, that is reasonable. I'll try with uncompressed files or use phmmer, hmmbuild and hmmsearch instead. Thanks very much for the reply!

EddyRivasLab / hmmer

Error in jackhmmer when using .gz compressed file as target database #241