gem-pasteur / Integron_Finder

Bioinformatics tool to find integrons in bacterial genomes
GNU General Public License v3.0
67 stars 22 forks source link

problem with overlapping/near identical attC sites #69

Closed jeanrjc closed 2 years ago

jeanrjc commented 4 years ago

observation: we may find 2 attC that are detected with the beginning of the 2nd attC that starts before the first attC ends:

ID_integron ID_replicon element pos_beg pos_end strand evalue distance_2attC
integron_01 VIVU001.C.00013.C001 attc_014 1693083 1693209 -1.0 0.099 339
integron_01 VIVU001.C.00013.C001 attc_015 1693084 1693209 -1.0 0.017 3310967

Consequence: 2nd attC is one too many. The distance reported is thus wrong.

impact: minimal

We should remove duplicate for on pos_beg or pos_end here, not just pos_beg.

bneron commented 4 years ago

can you provide data to integrate them in tests?

jeanrjc commented 4 years ago

Hi @bneron, I think you have access to VIVU001.C.00013.C001, let me know if you can't reproduce the issue.

bneron commented 4 years ago

I haven't access to VIVU001.C.00013.C001

bneron commented 3 years ago

does the strand is important to determine duplicates?

bneron commented 3 years ago

if we remove attc sites which have pos_beg and pos_end close (less than 3 bases diff) what should the behavior if we have 3 attc with the following pos_beg/pos_end

Accession_number cm_attC cm_debut cm_fin pos_beg pos_end sens evalue
ACBA.007.P01_13 attc_4 1 47 5500 7000 + 1.100000e-07
ACBA.007.P01_13 attc_4 1 47 5502 7002 + 1.100000e-04
ACBA.007.P01_13 attc_4 1 47 5504 7004 + 1.100000e-03
cachapuz2001 commented 3 years ago

does the strand is important to determine duplicates?

I would say it is not. They are partly palindromic. I would suggest just check the overlap.

  • The second attc is at less than 3 bases than the first one and should be removed ?
  • The third attc is at less than 3 bases than the second one but more than 3 bases than the first one so should we remove or keep it ? My opinion: one should remove all that overlap more than 50% of the size. I.e. take the best one and remove those that overlap more than 50%