Closed jeanrjc closed 2 years ago
can you provide data to integrate them in tests?
Hi @bneron, I think you have access to VIVU001.C.00013.C001, let me know if you can't reproduce the issue.
I haven't access to VIVU001.C.00013.C001
does the strand is important to determine duplicates?
if we remove attc sites which have pos_beg and pos_end close (less than 3 bases diff) what should the behavior if we have 3 attc with the following pos_beg/pos_end
Accession_number | cm_attC | cm_debut | cm_fin | pos_beg | pos_end | sens | evalue |
---|---|---|---|---|---|---|---|
ACBA.007.P01_13 | attc_4 | 1 | 47 | 5500 | 7000 | + | 1.100000e-07 |
ACBA.007.P01_13 | attc_4 | 1 | 47 | 5502 | 7002 | + | 1.100000e-04 |
ACBA.007.P01_13 | attc_4 | 1 | 47 | 5504 | 7004 | + | 1.100000e-03 |
does the strand is important to determine duplicates?
I would say it is not. They are partly palindromic. I would suggest just check the overlap.
- The second attc is at less than 3 bases than the first one and should be removed ?
- The third attc is at less than 3 bases than the second one but more than 3 bases than the first one so should we remove or keep it ? My opinion: one should remove all that overlap more than 50% of the size. I.e. take the best one and remove those that overlap more than 50%
observation: we may find 2 attC that are detected with the beginning of the 2nd attC that starts before the first attC ends:
Consequence: 2nd attC is one too many. The distance reported is thus wrong.
impact: minimal
We should remove duplicate for on
pos_beg
orpos_end
here, not justpos_beg
.