Closed jeanrjc closed 3 years ago
Could you provide data and conditions to reproduce the bug
Hello @bneron you can find attached a multifasta file with both sequences above: KLPN_test.zip
running: integron_finder --cpu 4 --local-max KLPN_test.fst --promoter-attI
gives:
ID_integron ID_replicon element pos_beg pos_end strand evalue type_elt annotation model type default distance_2attC considered_topology
integron_01 KLPN.1018.02497.0043 KLPN.1018.03231.0064_1 1 240 -1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.02497.0043 KLPN.1018.02497.0043_1 2 352 1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.02497.0043 KLPN.1018.02497.0043_6 5205 6218 -1 1.3e-25 protein intI intersection_tyr_intI complete No NA lin
integron_01 KLPN.1018.02497.0043 Pc_int1 6221 6247 1 NA Promoter Pc_1 NA complete No NA lin
integron_01 KLPN.1018.02497.0043 P_intI1 6239 6273 -1 NA Promoter Pint_1 NA complete No NA lin
integron_01 KLPN.1018.02497.0043 attI1 6298 6356 1 NA attI attI_1 NA complete No NA lin
integron_01 KLPN.1018.02497.0043 KLPN.1018.02497.0043_7 6363 6860 1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.02497.0043 attc_001 6855 6944 1 5e-06 attC attC attc_4 complete No NA lin
integron_01 KLPN.1018.02497.0043 attc_002 7205 7264 1 1.8e-07 attC attC attc_4 complete No 261.0 lin
integron_01 KLPN.1018.02497.0043 KLPN.1018.02497.0043_8 7280 8059 1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.02497.0043 attc_003 8061 8120 1 2.6000000000000003e-10 attC attC attc_4 complete No 797.0 lin
integron_02 KLPN.1018.02497.0043 KLPN.1018.03231.0064_3 1125 2138 -1 1.3e-25 protein intI intersection_tyr_intI In0 No NA lin
integron_01 KLPN.1018.03231.0064 attc_001 72 123 1 0.055 attC attC attc_4 complete No NA lin
integron_01 KLPN.1018.03231.0064 KLPN.1018.03231.0064_3 1125 2138 -1 1.3e-25 protein intI intersection_tyr_intI complete No NA lin
integron_01 KLPN.1018.03231.0064 Pc_int1 2141 2167 1 NA Promoter Pc_1 NA complete No NA lin
integron_01 KLPN.1018.03231.0064 P_intI1 2159 2193 -1 NA Promoter Pint_1 NA complete No NA lin
integron_01 KLPN.1018.03231.0064 attI1 2218 2276 1 NA attI attI_1 NA complete No NA lin
integron_01 KLPN.1018.03231.0064 KLPN.1018.03231.0064_4 2285 2767 1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.03231.0064 attc_002 2762 2847 1 0.0027 attC attC attc_4 complete No 2639.0 lin
integron_01 KLPN.1018.03231.0064 KLPN.1018.03231.0064_5 2988 3254 1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.03231.0064 KLPN.1018.02497.0043_6 5205 6218 -1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.03231.0064 KLPN.1018.02497.0043_7 6363 6860 1 NA protein protein NA complete No NA lin
integron_02 KLPN.1018.03231.0064 KLPN.1018.02497.0043_6 5205 6218 -1 1.3e-25 protein intI intersection_tyr_intI In0 No NA lin
whereas it should give:
ID_integron ID_replicon element pos_beg pos_end strand evalue type_elt annotation model type default distance_2attC considered_topology
integron_01 KLPN.1018.02497.0043 KLPN.1018.02497.0043_6 5205 6218 -1 1.3e-25 protein intI intersection_tyr_intI complete No NA lin
integron_01 KLPN.1018.02497.0043 Pc_int1 6221 6247 1 NA Promoter Pc_1 NA complete No NA lin
integron_01 KLPN.1018.02497.0043 P_intI1 6239 6273 -1 NA Promoter Pint_1 NA complete No NA lin
integron_01 KLPN.1018.02497.0043 attI1 6298 6356 1 NA attI attI_1 NA complete No NA lin
integron_01 KLPN.1018.02497.0043 KLPN.1018.02497.0043_7 6363 6860 1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.02497.0043 attc_001 6855 6944 1 5e-06 attC attC attc_4 complete No NA lin
integron_01 KLPN.1018.02497.0043 attc_002 7205 7264 1 1.8e-07 attC attC attc_4 complete No 261.0 lin
integron_01 KLPN.1018.02497.0043 KLPN.1018.02497.0043_8 7280 8059 1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.02497.0043 attc_003 8061 8120 1 2.6000000000000003e-10 attC attC attc_4 complete No 797.0 lin
integron_01 KLPN.1018.03231.0064 KLPN.1018.03231.0064_3 1125 2138 -1 1.3e-25 protein intI intersection_tyr_intI complete No NA lin
integron_01 KLPN.1018.03231.0064 Pc_int1 2141 2167 1 NA Promoter Pc_1 NA complete No NA lin
integron_01 KLPN.1018.03231.0064 P_intI1 2159 2193 -1 NA Promoter Pint_1 NA complete No NA lin
integron_01 KLPN.1018.03231.0064 attI1 2218 2276 1 NA attI attI_1 NA complete No NA lin
integron_01 KLPN.1018.03231.0064 KLPN.1018.03231.0064_4 2285 2767 1 NA protein protein NA complete No NA lin
integron_01 KLPN.1018.03231.0064 attc_001 2762 2847 1 0.0027 attC attC attc_4 complete No 2639.0 lin
integron_01 KLPN.1018.03231.0064 KLPN.1018.03231.0064_5 2988 3254 1 NA protein protein NA complete No NA lin
The following lines were removed because of the topology issue explained above:
> integron_01 KLPN.1018.02497.0043 KLPN.1018.02497.0043_1 2 352 1 NA protein protein NA complete No NA lin
> integron_01 KLPN.1018.03231.0064 attc_001 72 123 1 0.055 attC attC attc_4 complete No NA lin
And those lines were removed because of another bug I told you about by mail, where proteins are not assigned to their own contig (different ID in columns 2 and 3)
> integron_01 KLPN.1018.02497.0043 KLPN.1018.03231.0064_1 1 240 -1 NA protein protein NA complete No NA lin
> integron_02 KLPN.1018.02497.0043 KLPN.1018.03231.0064_3 1125 2138 -1 1.3e-25 protein intI intersection_tyr_intI In0 No NA lin
> integron_01 KLPN.1018.03231.0064 KLPN.1018.02497.0043_6 5205 6218 -1 NA protein protein NA complete No NA lin
> integron_01 KLPN.1018.03231.0064 KLPN.1018.02497.0043_7 6363 6860 1 NA protein protein NA complete No NA lin
> integron_02 KLPN.1018.03231.0064 KLPN.1018.02497.0043_6 5205 6218 -1 1.3e-25 protein intI intersection_tyr_intI In0 No NA lin
Topology not used to aggregate proteins, promoter and attI
Observation:
This is an integron on a ~8kb contig. Clearly the prot at the beginning shouldn't be part of the integron. Linear topology is not taken into account. Consequences: Some integrons appear be bigger that they should, when using on contigs. Should be changed somewhere in integron.py, L.575 for
add_proteins()
to take the linear option into consideration. Same foradd_attI()
andadd_promoter()
impact: ModerateTopology not used to aggregate attC sites before local max
observation:
This is an integron on a ~3kb contig. Clearly the attC at the beginning shouldn't be part of the integron. Consequences: Some integrons appear be bigger that they should, in contig sequences, and they have a weird organization that is misleading. Should be changed in attc.py L89 and L100 to take the linear option into consideration
impact: Moderate