gem-pasteur / Integron_Finder

Bioinformatics tool to find integrons in bacterial genomes
GNU General Public License v3.0
64 stars 22 forks source link

Circularity with multi-fasta input #43

Closed asetGem closed 6 years ago

asetGem commented 6 years ago

Behaviour expected:

By default:

Available options:

To sum-up. For each replicon, its topology is:

Add replicon topology information in the final tab file.

Steps

bneron commented 6 years ago

commit ce40896 implements

but 8b38da1 install integron package needed for this feature

bneron commented 6 years ago

I have aquestion for the following feature

the topology will be set by --circ or --linear option or --topology-file but in code I read

# If sequence is too small, it can be problematic when using circularity
    if len(SEQUENCE) > 4 * DISTANCE_THRESHOLD:
        circular = not args.linear
    else:
circular = False

so the topology set by the user can be override. in the results what value must appear the topology set by the user or the topology effectively used?

jeanrjc commented 6 years ago

Well, I'm not sure. Ideally we shouldn't have such condition and let the algo behave normally. We should have a condition to stop in case the entire sequence is parsed or once an attC site is found for the second times, otherwise the expansion will never stops.

I think we can keep it like this for now, and in the column use the value used by the algo. The true topology is in the topology file. We should name this column "Topology_considered" or something similar to stress the fact that it's not the topology of the sequence but the topology used by IF.

I think it's really an edge case, but we should mention it on the doc if it's not already the case. If it turns out that many people have very small plasmids with integron over the edge, we might implement a proper solution later, or maybe use a lower value than 4 * DISTANCE_THRESHOLD.

TL;DR: Use the value of the parameter in the output file and not the value of the topology file, and rename the column to stress the fact that it's the value of the parameter, not the actual state of the sequence.

Thanks!