biocore / microprot

structural annotation pipeline for microbial genomes and metagenomes
BSD 3-Clause "New" or "Revised" License
1 stars 6 forks source link

BUG: split_search.mask_sequence does not pick optimal fragments #51

Open tkosciol opened 7 years ago

tkosciol commented 7 years ago

example HHsearch output:

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 1W36_D EXODEOXYRIBONUCLEASE V   97.3 1.1E-06 2.1E-11   82.9   0.0   31  149-181   152-182 (608)
  2 2P65_A Hypothetical protein PF  97.3 1.5E-06 2.7E-11   63.7   0.0   21  162-182    42-62  (187)
  3 1W4R_A THYMIDINE KINASE (E.C.2  97.2 1.6E-06   3E-11   71.1   0.0   22  160-181    17-38  (195)
  4 3B85_B Phosphate starvation-in  97.2 1.8E-06 3.3E-11   63.6   0.0   20  163-182    22-41  (208)
  5 2ZPA_B Uncharacterized protein  97.2 2.3E-06 4.1E-11   87.2   0.0  138  162-343   191-341 (671)
  6 3E1S_A Exodeoxyribonuclease V,  97.1 2.8E-06 5.1E-11   80.4   0.0   31  150-182   193-223 (574)
  7 3GP8_A Exodeoxyribonuclease V,  97.1 2.8E-06 5.1E-11   80.4   0.0   31  150-182   193-223 (574)
  8 5FTF_A TPR DOMAIN PROTEIN (E.C  97.1 3.8E-06 6.9E-11   73.1   0.0   21  162-182    20-40  (433)
  9 3WHK_A Proteasome-activating n  97.1   4E-06 7.2E-11   64.0   0.0   23  161-183    49-71  (270)
 10 3WHK_H Proteasome-activating n  97.1   4E-06 7.2E-11   64.0   0.0   23  161-183    49-71  (270)
 11 5FTB_A TPR DOMAIN PROTEIN (E.C  97.1 4.2E-06 7.6E-11   72.8   0.0   21  162-182    20-40  (433)
 12 2B8T_D Thymidine kinase (E.C.2  97.0 4.6E-06 8.3E-11   70.8   0.0   23  159-181     8-30  (223)
 13 2B8T_A Thymidine kinase (E.C.2  97.0 4.6E-06 8.3E-11   70.8   0.0   23  159-181     8-30  (223)
 14 4UXH_B THYMIDINE KINASE (E.C.2  97.0   5E-06   9E-11   66.0   0.0   22  161-182     3-24  (184)
 15 4UXH_A THYMIDINE KINASE (E.C.2  97.0   5E-06   9E-11   66.0   0.0   22  161-182     3-24  (184)
 16 5FHH_B ATP-dependent DNA helic  97.0 5.2E-06 9.3E-11   73.6   0.0   19  163-181    22-40  (442)
 17 5FHH_A ATP-dependent DNA helic  97.0 5.2E-06 9.3E-11   73.6   0.0   19  163-181    22-40  (442)
 18 5FUV_A THYMDINE KINASE (E.C.2.  97.0 5.2E-06 9.4E-11   64.9   0.0   21  162-182     2-22  (182)
 19 5FUW_A THYMDINE KINASE (E.C.2.  97.0 5.2E-06 9.4E-11   64.9   0.0   21  162-182     2-22  (182)
 20 3VKW_A Replicase large subunit  96.9 7.7E-06 1.4E-10   77.3   0.0   95  161-262   159-275 (446)
 21 3UPU_A ATP-dependent DNA helic  96.9 8.4E-06 1.5E-10   73.4   0.0   46  118-181    18-63  (459)
 22 5EQT_A Proteasome-activating n  96.8 1.6E-05 2.9E-10   60.3   0.0   24  161-184    37-60  (257)
 23 5LD2_D RecBCD enzyme subunit R  96.8 1.7E-05 3.1E-10   76.5   0.0   31  149-181   153-183 (609)
 24 2J87_B THYMIDINE KINASE (E.C.2  96.7 1.7E-05 3.1E-10   62.0   0.0   21  162-182     3-23  (177)
 25 2J87_D THYMIDINE KINASE (E.C.2  96.7 1.7E-05 3.1E-10   62.0   0.0   21  162-182     3-23  (177)
 26 1JBK_A CLPB PROTEIN; Beta barr  96.7 1.9E-05 3.4E-10   58.2   0.0   22  161-182    41-62  (195)
 27 4N0N_A Replicase polyprotein 1  96.7   2E-05 3.6E-10   74.7   0.0   37  214-260   254-290 (423)

from this, mask_sequence picks:

>NZ_stefan.1_22_149-181 # PDB # 1 # 2017-07-01 18:20:24
>NZ_stefan.1_22_214-260 # PDB # 1 # 2017-07-01 18:20:24

with parameters:

 min_prob: 95
 min_fragment_length: 10

However, it should pick 2ZPA_B 162-343 (hit 5), as it is the longest fragment overlapping with shorter picks. It looks like HHsearch error, but it's something we could account for.

sjanssen2 commented 7 years ago

Hi Tomasz, could you send me the fill HHsearch output file to be able to run it through my script.

tkosciol commented 7 years ago

@sjanssen2 is that something you could look into again in the next week or 2?

sjanssen2 commented 7 years ago

As far as I remember the problem is on the conceptual side: we ourselves do not fully understand and thus are not able to define what we mean by "optimal" fragments to choose.