UCLOrengoGroup / cath-tools

Protein structure comparison tools such as SSAP and SNAP
http://cath-tools.readthedocs.io
GNU General Public License v3.0
57 stars 14 forks source link

Parsing error when using hmmscan results file #74

Closed mrztm closed 3 years ago

mrztm commented 3 years ago

Hi,

I am trying to analyse the attached file using cath-resolve-hits. It is a result from hmmscan version 3.3.2 (without the alignments). Invoking the command cath-resolve-hits --input-format hmmscan_out hmmscan_res.txt results in the following error message 2021-03-09 21:15:21.306957 [cath-resolve-hits|error ] Unable to parse/process resolve-hits input data file "HMMER_Full.txt" of format hmmscan_out. Error was: Failed to parse a number (of type unsigned int) from

Could it be that cath-resolve-hits is incompatible with hmmscan version3.3.2?

Many thanks for your help Thomas

hmmscan_res.txt

tonyelewis commented 3 years ago

Thank you for using cath-resolve-hits and for reporting this issue.

From an initial look, I think it's failing because it's expecting to see alignments in the output. Note that the header of that file includes # show alignments in output: no. Please can you try regenerating the hmmscan output with alignments included and report back? Thanks.

I will look to make the error message more helpful in that situation.

mrztm commented 3 years ago

Hi Tony,

I have generated the result file including the alignments, see attachment. Unfortunately, it still causes an error.

cath-resolve-hits --input-format hmmscan_out --summarise-to-file test_summary.txt hmmscan_wali.txt 2021-03-10 16:12:33.145799 [cath-resolve-hits|error ] Unable to parse/process resolve-hits input data file "test3.txt" of format hmmscan_out. Error was: Failed to parse a number (of type unsigned int) from CS

I am going to try older versions of hmmscan. I hope that helps to narrow down the issue.

Many thanks for the quick response Thomas

Dr Thomas Millat Senior Research Fellow Nottingham BBSRC/EPSRC Synthetic Biology Research Centre (SBRC) [SBRC_Signature] Room B18 University of Nottingham Biodiscovery Institute The University of Nottingham University Park Nottingham NG7 2RD

t: +44(0)115 95 16827 e: thomas.millat@nottingham.ac.ukmailto:thomas.millat@nottingham.ac.uk w: www.sbrc-nottingham.ac.ukhttp://www.sbrc-nottingham.ac.uk/ @SbrcNottingham [BBSRC_EPSRC_signature]

Von: Tony E Lewis notifications@github.com Gesendet: Mittwoch, 10. März 2021 11:24 An: UCLOrengoGroup/cath-tools cath-tools@noreply.github.com Cc: Thomas Millat mrztm@exmail.nottingham.ac.uk; Author author@noreply.github.com Betreff: Re: [UCLOrengoGroup/cath-tools] Parsing error when using hmmscan results file (#74)

Thank you for using cath-resolve-hits and for reporting this issue.

From an initial look, I think it's failing because it's expecting to see alignments in the output. Note that the header of that file includes # show alignments in output: no. Please can you try regenerating the hmmscan output with alignments included and report back? Thanks.

I will look to make the error message more helpful in that situation.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/UCLOrengoGroup/cath-tools/issues/74#issuecomment-795275874, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKXVJSQMOMXI4DEEZABGPBDTC5JFHANCNFSM4Y4QH4OQ.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please contact the sender and delete the email and attachment.

Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. Email communications with the University of Nottingham may be monitored where permitted by law.

hmmscan :: search sequence(s) against a profile database

HMMER 3.3.2 (Nov 2020); http://hmmer.org/

Copyright (C) 2020 Howard Hughes Medical Institute.

Freely distributed under the BSD open source license.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

query sequence file: CLAU_0004.faa

target HMM database: /store/HMM_data/PFAM/Pfam-A

per-seq hits tabular output: HMMER_Table.txt

per-dom hits tabular output: HMMER_Domain.txt

pfam-style tabular hit output: HMMER_Pfam.txt

profile reporting threshold: E-value <= 0.01

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query: lcl CLAU_CLAU_0004 [L=324] Description: [product=Glyoxylate reductase] [location=1365..2339] [locus_tag=CLAU_0004] [protein_id=ALU34433.1] Scores for complete sequence (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Model Description
6.7e-59  198.2   0.0    9.1e-59  197.8   0.0    1.2  1  2-Hacid_dh_C   D-isomer specific 2-hydroxyacid dehydrogenase,
1.3e-25   89.6   0.1    1.6e-25   89.3   0.1    1.1  1  2-Hacid_dh     D-isomer specific 2-hydroxyacid dehydrogenase,
3.1e-06   26.9   1.0    1.1e-05   25.1   0.3    2.0  2  IlvN           Acetohydroxy acid isomeroreductase, NADPH-bind
4.6e-06   27.0   0.4      1e-05   25.8   0.1    1.7  2  NAD_binding_2  NAD binding domain of 6-phosphogluconate dehyd
9.6e-06   25.8   0.2    2.2e-05   24.7   0.2    1.6  1  AdoHcyase_NAD  S-adenosyl-L-homocysteine hydrolase, NAD bindi
6.4e-05   23.6   0.1    0.00017   22.2   0.1    1.7  1  F420_oxidored  NADP oxidoreductase coenzyme F420-dependent
0.00053   20.1   0.1     0.0014   18.7   0.1    1.8  1  Shikimate_DH   Shikimate / quinate 5-dehydrogenase

Domain annotation for each model (and alignments):

2-Hacid_dh_C D-isomer specific 2-hydroxyacid dehydrogenase, NAD binding domain

score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc


1 ! 197.8 0.0 3.5e-62 9.1e-59 2 178 .] 113 292 .. 112 292 .. 0.94

Alignments for each domain: == domain 1 score: 197.8 bits; conditional E-value: 3.5e-62 HHHHHHHTTHHHHHHHHHTTCC.....HTTCCGSBS-GTTSEEEEES-SHHHHHHHHHHHHTT-EEEEESSSHHHHHCHHHHTEEECSHHH CS 2-Hacid_dh_C 2 alllallrrlaeadeevregew.....ssekallgkelsgktvGiiGlGrIGqavakrlkafgmkviaydrskkeeeeeeelgveyvslee 87 +lll+l+ +++ ++++v+eg+w + + + el kt GiiG+G+IGqa+ k+++a+gmkv+ay+r+k++ e+++v+y++l++ lcl|CLAU_CLAU_0004 113 SLLLELTNHVSIHNKAVKEGQWneigeWCFWKKPLMELGSKTAGIIGYGKIGQATSKIVQAMGMKVLAYNRHKNKV--LESENVKYAELND 201 799*****76554334466777****8854..677789* PP

                     HHHH-SEEEE-S--STTTTTSBSHHHHHCSTTTEEEEESS-GGGB-HHHHHHHHHTTSCCEEEES--SSSSSGTCHHHHHSTTEEE-SS-T CS
    2-Hacid_dh_C  88 llaesDivslhlpltketrhlinaeelakmkkgavliNtaRGglvdeeaLleaLksgkiagaalDvfeeeplpedspllelpnviltPHia 178
                     ++++sD+++lh+plt+et+++in++++++mk+g++++N aRG+l+ ee+L+ aL+ gk+ gaalDv++ ep++++spll+++n+i+tPHi+

lcl|CLAU_CLAU_0004 202 VFEKSDVIFLHCPLTEETKGIINSKSIEHMKDGVIIVNNARGPLIVEEDLAHALNIGKVYGAALDVTSREPIEKESPLLKAENCIITPHIS 292 *****96 PP

2-Hacid_dh D-isomer specific 2-hydroxyacid dehydrogenase, catalytic domain

score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc


1 ! 89.3 0.1 6.1e-29 1.6e-25 15 132 .. 21 321 .. 4 323 .. 0.94

Alignments for each domain: == domain 1 score: 89.3 bits; conditional E-value: 6.1e-29 HHHTTE.EEEEES.....SSSHTCHHHGGTTESEEEE-TTS-BSHHHHHHHTT--EEEESSSSCTTB-HHHHHHTT-EEEE-TTTTHHHHH CS xxxxxx.xxxxxx.....xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx RF 2-Hacid_dh 15 lkekgv.evevkd.....ellteelaekakdadalivrsntkvtaevLeklpkLkviaragvGvDnvDldaakerGilVtnvpgystesvA 99 l++ + +++v+d ++ + ++e+++da++++++++ ++++v++ +pkLk+ + +++G++ vD++a k+ G++Vtn+p+yst++vA lcl|CLAU_CLAU_0004 21 LEK--LgDLTVYDktifdNSNDDLIIERIRDAEVVFTNKTP-ISENVFKSCPKLKYLGVFATGYNVVDIKASKKFGVVVTNIPSYSTDAVA 108 455..335555555555377788999*9998.*** PP

                     HHH........................................................................................ CS
                     xxx........................................................................................ RF
      2-Hacid_dh 100 Elt........................................................................................ 102
                     +++                                                                                        

lcl|CLAU_CLAU_0004 109 QMAvsllleltnhvsihnkavkegqwneigewcfwkkplmelgsktagiigygkigqatskivqamgmkvlaynrhknkvlesenvkyael 199 *** PP

                     ........................................................................................... CS
                     ........................................................................................... RF
      2-Hacid_dh 103 ........................................................................................... 102

lcl|CLAU_CLAU_0004 200 ndvfeksdviflhcplteetkgiinsksiehmkdgviivnnargpliveedlahalnigkvygaaldvtsrepiekespllkaenciitph 290 *** PP

                     ..T-BHHHHHHHHHHHHHHHHHHHTTCCGTTB CS
                     ..xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx RF
      2-Hacid_dh 103 ..faTeeaqeriaeeaaenllkalkgespana 132
                       +a +e++er+  +a+enl+++l+g +p n+

lcl|CLAU_CLAU_0004 291 isWAAKETRERLLNIAVENLKNFLEG-NPINV 321 **.55555 PP

IlvN Acetohydroxy acid isomeroreductase, NADPH-binding domain

score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc


1 ? 0.1 0.0 0.2 5.3e+02 47 80 .. 79 111 .. 43 127 .. 0.67 2 ! 25.1 0.3 4.3e-09 1.1e-05 2 92 .. 150 238 .. 149 260 .. 0.83

Alignments for each domain: == domain 1 score: 0.1 bits; conditional E-value: 0.2 TT-EEEEHHHHHHT-SEEEE-S-HHHHHHHHHHH CS IlvN 47 egfevltvaeavkkadvvmiliPDelqkevyeee 80 +g++v++++ a+kk vv++ iP ++v + + lcl|CLAU_CLAU_0004 79 TGYNVVDIK-ASKKFGVVVTNIPSYSTDAVAQMA 111 445555543.445555555555555555555555 PP

== domain 2 score: 25.1 bits; conditional E-value: 4.3e-09 HHTS-EEEES-SHHHHHHHHHHHHTT--EEEEE-TT-HHHHHHHHTT-EEEEHHHHHHT-SEEEE-S-.HHHHHHHHHHHTGGG--TT-EE CS IlvN 2 lkgkkiaviGyGsqGhaqalnlrdsgldvvvglregsksvkkAkeegfevltvaeavkkadvvmiliP.Delqkevyeeeiepnlkegkal 91 l +k+ +iGyG G+a ++ ++ g++v +r+++ k ++e+++ ++++++ +k+dv+++ P e k +++++ +++k+g ++ lcl|CLAU_CLAU_0004 150 LGSKTAGIIGYGKIGQATSKIVQAMGMKVLAYNRHKN---KVLESENVKYAELNDVFEKSDVIFLHCPlTEETKGIINSKSIEHMKDGVII 237 568999****98888877...56678999943556678888888899999865 PP

                     E CS
            IlvN  92 a 92 
                     +

lcl|CLAU_CLAU_0004 238 V 238 4 PP

NAD_binding_2 NAD binding domain of 6-phosphogluconate dehydrogenase

score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc


1 ? -2.2 0.0 1.6 4.2e+03 11 45 .. 80 114 .. 73 126 .. 0.69 2 ! 25.8 0.1 4e-09 1e-05 3 89 .. 156 240 .. 154 256 .. 0.89

Alignments for each domain: == domain 1 score: -2.2 bits; conditional E-value: 1.6 HHHHHHHHHHTT-EEEEE-SSHHHHHHHHHTTEEE CS NAD_binding_2 11 Gsnmarnllkagykvavydrtkekveelvaegaka 45 G n + +++++ v+v + + +++++++ +++ lcl|CLAU_CLAU_0004 80 GYNVVDIKASKKFGVVVTNIPSYSTDAVAQMAVSL 114 56666666778888888888877777777766544 PP

== domain 2 score: 25.8 bits; conditional E-value: 4e-09 EEEE-SHHHHHHHHHHHHTT-EEEEE-SSHHHHHHHHHTTEEEESSHHHHHHCBSEEEE-SSSHHHHHHHHHC.HCCC--TT-EEEE- CS NAD_binding_2 3 gfiGlGvMGsnmarnllkagykvavydrtkekveelvaegakaaesieelvasldvvilmvkagkavdevieg.llealekgdilidg 89 g+iG G+ G+ + +++ g+kv +y+r+k+kv +++ + ++++++++++ dv++l + +++++ +i++ +e+++ g i+++ lcl|CLAU_CLAU_0004 156 GIIGYGKIGQATSKIVQAMGMKVLAYNRHKNKVL---ESENVKYAELNDVFEKSDVIFLHCPLTEETKGIINSkSIEHMKDGVIIVNN 240 99***87764...45556778999****9999**999975 PP

AdoHcyase_NAD S-adenosyl-L-homocysteine hydrolase, NAD binding domain

score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc


1 ! 24.7 0.2 8.3e-09 2.2e-05 21 108 .. 150 239 .. 146 260 .. 0.85

Alignments for each domain: == domain 1 score: 24.7 bits; conditional E-value: 8.3e-09 -TTSEEEEE--SHHHHHHHHHHHHCT-EEEEE-S-HHHHHHHHCTT-EE--HHHCTTT-SEEE.E...-SSSSSSB-HHHHCCS-TTEEEE CS AdoHcyase_NAD 21 iaGkvavvaGyGdvGkGcaaslkglGarvivteidPinalqaameGfevvtleevvkkadifv.t...ttGnkdiitvehlkkmkedaivc 107 ++ k+a ++GyG +G+ +++ ++++G +v+ + l + e + ++l++v +k+d++ t +k ii+++ +++mk+ i+ lcl|CLAU_CLAU_0004 150 LGSKTAGIIGYGKIGQATSKIVQAMGMKVLAYNRHKNKVLES--ENVKYAELNDVFEKSDVIFlHcplTEETKGIINSKSIEHMKDGVIIV 238 6889***99998888876..778889**86523222456899**998 PP

                     E CS
   AdoHcyase_NAD 108 n 108
                     n

lcl|CLAU_CLAU_0004 239 N 239 8 PP

F420_oxidored NADP oxidoreductase coenzyme F420-dependent

score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc


1 ! 22.2 0.1 6.4e-08 0.00017 3 73 .. 156 217 .. 154 242 .. 0.77

Alignments for each domain: == domain 1 score: 22.2 bits; conditional E-value: 6.4e-08 EEETTSHHHHHHHHHHHHTTS.GGCEEEEEESSCCCHHHHHHHHC-EEECECHHHHHHH-SEEEE-S-HHH CS F420_oxidored 3 aiiGaGnmgealasgllaagaqpheivvansrnpekaeelaeelgvkvtavsneeaaeeadvvvlavkpea 73 +iiG+G++g+a + ++a g +++ a +r+++k+ e ++ + + ++++e++dv++l ++ ++ lcl|CLAU_CLAU_0004 156 GIIGYGKIGQATSKIVQAMG---MKVL-AYNRHKNKVLESENV-----KYAELNDVFEKSDVIFLHCPLTE 217 9***....989999988665433.....3345899****9854 PP

Shikimate_DH Shikimate / quinate 5-dehydrogenase

score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc


1 ! 18.7 0.1 5.4e-07 0.0014 8 99 .. 148 232 .. 143 266 .. 0.83

Alignments for each domain: == domain 1 score: 18.7 bits; conditional E-value: 5.4e-07 S..TT-EEEEES-SHHHHHHHHHHHCCT-CEEEEEESSHHHHHHHHHHHT.ET-EEEEGCGHHHHHHT-SEEEE-...SSSSS-SB-HHHH CS Shikimate_DH 8 eslkekkvlliGaGemaelvakhLlakgvkkvvvaNRtlerakelaeelkgeeiealkleelkellaeadvvisa...taseepilekeev 95 +l +k++ +iG G++++ + k a g+k v+ NR +++ e +e +k+ el+++ +++dv++ t ++ i++++++ lcl|CLAU_CLAU_0004 148 MELGSKTAGIIGYGKIGQATSKIVQAMGMK-VLAYNRHKNKVLE------SE---NVKYAELNDVFEKSDVIFLHcplTEETKGIINSKSI 228 578899*****9.999*9988754......22...45778888889**9743447788889999888 PP

                     HCCH CS
    Shikimate_DH  96 eeal 99 
                     e+++

lcl|CLAU_CLAU_0004 229 EHMK 232 8776 PP

Internal pipeline statistics summary:

Query sequence(s): 1 (324 residues searched) Target model(s): 18259 (3090017 nodes) Passed MSV filter: 626 (0.0342845); expected 365.2 (0.02) Passed bias filter: 489 (0.0267813); expected 365.2 (0.02) Passed Vit filter: 49 (0.00268361); expected 18.3 (0.001) Passed Fwd filter: 10 (0.000547675); expected 0.2 (1e-05) Initial search space (Z): 18259 [actual number of targets] Domain search space (domZ): 7 [number of targets reported over threshold]

CPU time: 0.50u 0.20s 00:00:00.70 Elapsed: 00:00:00.40

Mc/sec: 2467.14

// [ok]

tonyelewis commented 3 years ago

Thanks for your reply.

This appears to be an issue with the file containing 'CS' records (which look like they're probably secondary structure predictions) that cath-resolve-hits wasn't expecting.

I've just pushed a fix for that.

mrztm commented 3 years ago

Hi Tony,

Many thanks for the quick response. The new version runs smoothly over the file.

Kind regards Thomas

Dr Thomas Millat Senior Research Fellow Nottingham BBSRC/EPSRC Synthetic Biology Research Centre (SBRC) [SBRC_Signature] Room B18 University of Nottingham Biodiscovery Institute The University of Nottingham University Park Nottingham NG7 2RD

t: +44(0)115 95 16827 e: @.**@.> w: www.sbrc-nottingham.ac.ukhttp://www.sbrc-nottingham.ac.uk/ @SbrcNottingham [BBSRC_EPSRC_signature]

Von: Tony E Lewis @.> Gesendet: Mittwoch, 10. März 2021 16:59 An: UCLOrengoGroup/cath-tools @.> Cc: Thomas Millat @.>; Author @.> Betreff: Re: [UCLOrengoGroup/cath-tools] Parsing error when using hmmscan results file (#74)

Thanks for your reply.

This appears to be an issue with the file containing 'CS' records (which look like they're probably secondary structure predictions) that cath-resolve-hits wasn't expecting.

I've just pushed a fix for that.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/UCLOrengoGroup/cath-tools/issues/74#issuecomment-795738428, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKXVJSVRAPEVINW73TFTLX3TC6QOVANCNFSM4Y4QH4OQ.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please contact the sender and delete the email and attachment.

Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. Email communications with the University of Nottingham may be monitored where permitted by law.

tonyelewis commented 3 years ago

That's good to hear. Thanks for taking the time to report.

tonyelewis commented 3 years ago

@mrztm In case it's of relevance to you, we've brought the build back up and you can now download the latest executables from: here.

(The Mac executables currently depend on a brew install of icu4c but we'll look to address that.)