ebi-pf-team / interproscan

Genome-scale protein function classification
Apache License 2.0
303 stars 67 forks source link

stepPantherRunHmmer3: File format problem in trying to open HMM file #363

Closed EmilieSmeets22 closed 6 months ago

EmilieSmeets22 commented 6 months ago

Hi,

I get an error while trying to run interproscan at the step "stepPantherRunHmmer3" and it seems the problem is with the HMM file.

[doutree@plop] input $ ml InterProScan
[doutree@plop] input $        interproscan.sh             -cpu 20             -i Daucus_carota.gene_chr_AGAT_chr01_proteins.fasta             -f TSV          -dp         --iprlookup  --goterms -t p -dra -appl TIGRFAM,FunFam,SFLD,PANTHER,Gene3D,Hamap,Coils,SMART,CDD,PRINTS,PIRSR,AntiFam,Pfam                 -b ../output/Daucus_carota.gene_chr_prot.fasta_interpro_updateParam
2024-04-29 10:52:12,186 [amqEmbeddedWorkerJmsContainer-5] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:222] ERROR - StepExecution with errors - stepName: stepPantherRunHmmer3
2024-04-29 10:52:12,269 [main] [uk.ac.ebi.interpro.scan.jms.master.StandaloneBlackBoxMaster:190] WARN - StepInstance 117 is being re-run following a failure.
2024-04-29 10:52:12,662 [amqEmbeddedWorkerJmsContainer-10] [uk.ac.ebi.interpro.scan.management.model.implementations.RunBinaryStep:199] ERROR - Command line failed with exit code: 1
Command: bin/hmmer/hmmer3/3.3/hmmsearch -Z 65000000 -E 0.001 --domE 0.00000001 --incdomE 0.00000001 --notextw --cpu 1 -o /data/run/Tools/InterProScan/5.67-99.0/tmp/plop.eu.seeds.basf.net_20240429_105141500_hjfn//jobPanther/000000002001_000000002500.raw.out --domtblout /data/run/Tools/InterProScan/5.67-99.0/tmp/plop.eu.seeds.basf.net_20240429_105141500_hjfn//jobPanther/000000002001_000000002500.raw.domtblout.out /data/prod/Tools/InterProScan/5.67-99.0/data/panther/18.0/famhmm/binHmm /data/run/Tools/InterProScan/5.67-99.0/tmp/plop.eu.seeds.basf.net_20240429_105141500_hjfn//jobPanther/000000002001_000000002500.fasta 
Error output from binary:

Error: File format problem in trying to open HMM file /data/prod/Tools/InterProScan/5.67-99.0/data/panther/18.0/famhmm/binHmm.
Opened /data/prod/Tools/InterProScan/5.67-99.0/data/panther/18.0/famhmm/binHmm.h3m, a pressed HMM file; but forma

2024-04-29 10:52:12,662 [amqEmbeddedWorkerJmsContainer-10] [uk.ac.ebi.interpro.scan.jms.worker.LocalJobQueueListener:216] ERROR - Execution thrown when attempting to executeInTransaction the StepExecution.  All database activity rolled back.
java.lang.IllegalStateException: Command line failed with exit code: 1
Command: bin/hmmer/hmmer3/3.3/hmmsearch -Z 65000000 -E 0.001 --domE 0.00000001 --incdomE 0.00000001 --notextw --cpu 1 -o /data/run/Tools/InterProScan/5.67-99.0/tmp/plop.eu.seeds.basf.net_20240429_105141500_hjfn//jobPanther/000000002001_000000002500.raw.out --domtblout /data/run/Tools/InterProScan/5.67-99.0/tmp/plop.eu.seeds.basf.net_20240429_105141500_hjfn//jobPanther/000000002001_000000002500.raw.domtblout.out /data/prod/Tools/InterProScan/5.67-99.0/data/panther/18.0/famhmm/binHmm /data/run/Tools/InterProScan/5.67-99.0/tmp/plop.eu.seeds.basf.net_20240429_105141500_hjfn//jobPanther/000000002001_000000002500.fasta 
Error output from binary:

Error: File format problem in trying to open HMM file /data/prod/Tools/InterProScan/5.67-99.0/data/panther/18.0/famhmm/binHmm.
Opened /data/prod/Tools/InterProScan/5.67-99.0/data/panther/18.0/famhmm/binHmm.h3m, a pressed HMM file; but forma

Installation: We download the following tarballs from http://ftp.ebi.ac.uk/pub/software/unix/iprscan interproscan-core-5.67-99.0.tar.gz interproscan-data-5.67-99.0.tar.gz We unpack interproscan-core-5.67-99.0.tar.gz to $INSTALLDIR We unpack interproscan-data-5.67-99.0.tar.gz to another location: /data/prod/Tools/InterProScan/5.67-99.0/data

Then we are performing the following commands:

sed -i "s@EASEL_DIR=@EASEL_DIR=$INSTALLDIRHMMER_interproscan/3.1b2/easel@" $INSTALLDIR/src/sfld/1.1/Makefile
cd $INSTALLDIR/src/sfld/1.1/ && make
cp -f $INSTALLDIR/src/sfld/1.1/sfld_postprocess $INSTALLDIR/bin/sfld/
cp -f $INSTALLDIR/src/sfld/1.1/sfld_preprocess.py $INSTALLDIR/bin/sfld/

We adapt the following line in $INSTALLDIR/interproscan.properties, so that it is making use of the data from interproscan-data-5.67-99.0.tar.gz data.directory=/data/prod/Tools/InterProScan/5.67-99.0/data

I thought the problem could be related to the Known Issue:

HMMER errors The HMM libraries provided by some member databases (SUPERFAMILY and SFLD) are not compatible with newer HMMER versions and an error will occur when those libraries are being indexed by hmmpress version greater than ‘3.1b1’. To avoid this issue we recommend using the HMMER binaries bundled with interproscan. https://interproscan-docs.readthedocs.io/en/latest/KnownIssues.html

However the interproscan module is making use of the pre-installed binary hmmer3.3 that comes together with the installation of interproscan: binary.hmmer3.path=${bin.directory}/hmmer/hmmer3/3.3

Any tips on how to solve this issue? Thanks already!

tgrego commented 6 months ago

It looks like something is wrong with the pressed hmm files... please try running python3 setup.py -f interproscan.properties to recreated the pressed files and give it another try.

EmilieSmeets22 commented 6 months ago

It worked, thank you!

[doutree@plop] input $ pwd /data/run/Projects/Vegetables/XXX/GFFtidy/input 
[doutree@plop] input $ ml InterProScan
[doutree@plop] input $ interproscan.sh \
 -cpu 20 \
 -i Daucus_carota.gene_chr_AGAT_chr01_proteins.fasta \
 -f TSV \
 -dp \
 --iprlookup --goterms -t p -dra -appl TIGRFAM,FunFam,SFLD,PANTHER,Gene3D,Hamap,Coils,SMART,CDD,PRINTS,PIRSR,AntiFam,Pfam \
13/05/2024 10:12:04:119 Welcome to InterProScan-5.67-99.0
13/05/2024 10:12:04:120 Running InterProScan v5 in STANDALONE mode... on Linux
13/05/2024 10:12:07:966 RunID: plop.eu.seeds.basf.net_20240513_101207734_3fwj
13/05/2024 10:12:15:998 Loading file /data/run/Projects/Vegetables/XXX/GFFtidy/input/Daucus_carota.gene_chr_AGAT_chr01_proteins.fasta
13/05/2024 10:12:15:999 Running the following analyses:
[AntiFam-7.0,CDD-3.20,Coils-2.2.1,FunFam-4.3.0,Gene3D-4.3.0,Hamap-2023_05,NCBIfam-14.0,PANTHER-18.0,Pfam-36.0,PIRSR-2023_05,PRINTS-42.0,SFLD-4,SMART-9.0]
Pre-calculated match lookup service DISABLED.  Please wait for match calculations to complete...
13/05/2024 10:17:21:220 25% completed
13/05/2024 10:28:33:534 50% completed
13/05/2024 10:42:45:497 75% completed
13/05/2024 10:55:56:260 90% completed
13/05/2024 11:55:56:340 95% completed
13/05/2024 12:17:55:732 100% done:  InterProScan analyses completed
[doutree@plop] input $ head Daucus_carota.gene_chr_AGAT_chr01_proteins.fasta.tsv
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     Gene3D  G3DSA:3.40.50.300       -       360     561     1.9E-20 T       13-05-2024      IPR027417       P-loop containing nucleoside triphosphate hydrolase -       -
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     FunFam  G3DSA:1.10.8.60:FF:000077       Peroxisome biogenesis protein 6 832     918     2.2E-43 T       13-05-2024      -       -       -  -
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     FunFam  G3DSA:3.40.50.300:FF:000109     Peroxisomal biogenesis factor 6 656     846     3.5E-100        T       13-05-2024      -       -  --
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     Gene3D  G3DSA:3.40.50.300       -       656     933     5.2E-99 T       13-05-2024      IPR027417       P-loop containing nucleoside triphosphate hydrolase -       -
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     Pfam    PF00004 ATPase family associated with various cellular activities (AAA) 389     539     4.3E-9  T       13-05-2024      IPR003959  ATPase, AAA-type, core   GO:0005524(InterPro)|GO:0016887(InterPro)       -
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     Pfam    PF00004 ATPase family associated with various cellular activities (AAA) 698     830     1.4E-41 T       13-05-2024      IPR003959  ATPase, AAA-type, core   GO:0005524(InterPro)|GO:0016887(InterPro)       -
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     PANTHER PTHR23077       AAA-FAMILY ATPASE       48      938     1.7E-204        T       13-05-2024      IPR050168       AAA ATPase domain-containing protein        GO:0005778(PANTHER)|GO:0005829(PANTHER)|GO:0016558(PANTHER)|GO:0016887(PANTHER) -
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     CDD     cd19527 RecA-like_PEX6_r2       670     829     6.16661E-102    T       13-05-2024      IPR047533       Peroxisomal biogenesis factor 6, second ATPase domain       -       -
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     FunFam  G3DSA:3.40.50.300:FF:001716     Peroxisome biogenesis protein 6 358     541     1.2E-75 T       13-05-2024      -       -       -  -
DcarChr1G00028240.1     a69948650c6412904753c3e250294475        945     Gene3D  G3DSA:1.10.8.60 -       832     918     5.2E-99 T       13-05-2024      -       -       -       -