gem-pasteur / Integron_Finder

Bioinformatics tool to find integrons in bacterial genomes
GNU General Public License v3.0
64 stars 22 forks source link

Topology issues #70

Closed jeanrjc closed 3 years ago

jeanrjc commented 4 years ago

Topology not used to aggregate proteins, promoter and attI

Observation:

This is an integron on a ~8kb contig. Clearly the prot at the beginning shouldn't be part of the integron. Linear topology is not taken into account. Consequences: Some integrons appear be bigger that they should, when using on contigs. Should be changed somewhere in integron.py, L.575 for add_proteins() to take the linear option into consideration. Same for add_attI() and add_promoter() impact: Moderate

Topology not used to aggregate attC sites before local max

observation:

This is an integron on a ~3kb contig. Clearly the attC at the beginning shouldn't be part of the integron. Consequences: Some integrons appear be bigger that they should, in contig sequences, and they have a weird organization that is misleading. Should be changed in attc.py L89 and L100 to take the linear option into consideration

impact: Moderate

bneron commented 4 years ago

Could you provide data and conditions to reproduce the bug

jeanrjc commented 4 years ago

Hello @bneron you can find attached a multifasta file with both sequences above: KLPN_test.zip

running: integron_finder --cpu 4 --local-max KLPN_test.fst --promoter-attI gives:

ID_integron ID_replicon element pos_beg pos_end strand  evalue  type_elt    annotation  model   type    default distance_2attC  considered_topology
integron_01 KLPN.1018.02497.0043    KLPN.1018.03231.0064_1  1   240 -1  NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    KLPN.1018.02497.0043_1  2   352 1   NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    KLPN.1018.02497.0043_6  5205    6218    -1  1.3e-25 protein intI    intersection_tyr_intI   complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    Pc_int1 6221    6247    1   NA  Promoter    Pc_1    NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    P_intI1 6239    6273    -1  NA  Promoter    Pint_1  NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    attI1   6298    6356    1   NA  attI    attI_1  NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    KLPN.1018.02497.0043_7  6363    6860    1   NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    attc_001    6855    6944    1   5e-06   attC    attC    attc_4  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    attc_002    7205    7264    1   1.8e-07 attC    attC    attc_4  complete    No  261.0   lin
integron_01 KLPN.1018.02497.0043    KLPN.1018.02497.0043_8  7280    8059    1   NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    attc_003    8061    8120    1   2.6000000000000003e-10  attC    attC    attc_4  complete    No  797.0   lin
integron_02 KLPN.1018.02497.0043    KLPN.1018.03231.0064_3  1125    2138    -1  1.3e-25 protein intI    intersection_tyr_intI   In0 No  NA  lin
integron_01 KLPN.1018.03231.0064    attc_001    72  123 1   0.055   attC    attC    attc_4  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    KLPN.1018.03231.0064_3  1125    2138    -1  1.3e-25 protein intI    intersection_tyr_intI   complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    Pc_int1 2141    2167    1   NA  Promoter    Pc_1    NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    P_intI1 2159    2193    -1  NA  Promoter    Pint_1  NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    attI1   2218    2276    1   NA  attI    attI_1  NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    KLPN.1018.03231.0064_4  2285    2767    1   NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    attc_002    2762    2847    1   0.0027  attC    attC    attc_4  complete    No  2639.0  lin
integron_01 KLPN.1018.03231.0064    KLPN.1018.03231.0064_5  2988    3254    1   NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    KLPN.1018.02497.0043_6  5205    6218    -1  NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    KLPN.1018.02497.0043_7  6363    6860    1   NA  protein protein NA  complete    No  NA  lin
integron_02 KLPN.1018.03231.0064    KLPN.1018.02497.0043_6  5205    6218    -1  1.3e-25 protein intI    intersection_tyr_intI   In0 No  NA  lin

whereas it should give:

ID_integron ID_replicon element pos_beg pos_end strand  evalue  type_elt    annotation  model   type    default distance_2attC  considered_topology
integron_01 KLPN.1018.02497.0043    KLPN.1018.02497.0043_6  5205    6218    -1  1.3e-25 protein intI    intersection_tyr_intI   complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    Pc_int1 6221    6247    1   NA  Promoter    Pc_1    NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    P_intI1 6239    6273    -1  NA  Promoter    Pint_1  NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    attI1   6298    6356    1   NA  attI    attI_1  NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    KLPN.1018.02497.0043_7  6363    6860    1   NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    attc_001    6855    6944    1   5e-06   attC    attC    attc_4  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    attc_002    7205    7264    1   1.8e-07 attC    attC    attc_4  complete    No  261.0   lin
integron_01 KLPN.1018.02497.0043    KLPN.1018.02497.0043_8  7280    8059    1   NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.02497.0043    attc_003    8061    8120    1   2.6000000000000003e-10  attC    attC    attc_4  complete    No  797.0   lin
integron_01 KLPN.1018.03231.0064    KLPN.1018.03231.0064_3  1125    2138    -1  1.3e-25 protein intI    intersection_tyr_intI   complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    Pc_int1 2141    2167    1   NA  Promoter    Pc_1    NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    P_intI1 2159    2193    -1  NA  Promoter    Pint_1  NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    attI1   2218    2276    1   NA  attI    attI_1  NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    KLPN.1018.03231.0064_4  2285    2767    1   NA  protein protein NA  complete    No  NA  lin
integron_01 KLPN.1018.03231.0064    attc_001    2762    2847    1   0.0027  attC    attC    attc_4  complete    No  2639.0  lin
integron_01 KLPN.1018.03231.0064    KLPN.1018.03231.0064_5  2988    3254    1   NA  protein protein NA  complete    No  NA  lin

The following lines were removed because of the topology issue explained above:

> integron_01   KLPN.1018.02497.0043    KLPN.1018.02497.0043_1  2   352 1   NA  protein protein NA  complete    No  NA  lin
> integron_01   KLPN.1018.03231.0064    attc_001    72  123 1   0.055   attC    attC    attc_4  complete    No  NA  lin

And those lines were removed because of another bug I told you about by mail, where proteins are not assigned to their own contig (different ID in columns 2 and 3)

> integron_01   KLPN.1018.02497.0043    KLPN.1018.03231.0064_1  1   240 -1  NA  protein protein NA  complete    No  NA  lin
> integron_02   KLPN.1018.02497.0043    KLPN.1018.03231.0064_3  1125    2138    -1  1.3e-25 protein intI    intersection_tyr_intI   In0 No  NA  lin
> integron_01   KLPN.1018.03231.0064    KLPN.1018.02497.0043_6  5205    6218    -1  NA  protein protein NA  complete    No  NA  lin
> integron_01   KLPN.1018.03231.0064    KLPN.1018.02497.0043_7  6363    6860    1   NA  protein protein NA  complete    No  NA  lin
> integron_02   KLPN.1018.03231.0064    KLPN.1018.02497.0043_6  5205    6218    -1  1.3e-25 protein intI    intersection_tyr_intI   In0 No  NA  lin