BjornFJohansson / pydna

Clone with Python! Data structures for double stranded DNA & simulation of homologous recombination, Gibson assembly, cut & paste cloning.
Other
166 stars 45 forks source link

Unintended behaviour in dseq.__init__? #253

Open manulera opened 3 months ago

manulera commented 3 months ago

Hi @BjornFJohansson I was wondering whether we want to support this kind of behaviour for Dseq, or whether it is unintended.

from pydna.dseq import Dseq
from pydna.utils import rc

seq1 = "ACGGCAGCCCGT"
seq2 = rc(seq1)

seq1_padded = "aaa" + seq1 + "aaa"
seq2_padded = "ccc" + seq2 + "ccc"

dseq1 = Dseq(seq1_padded, seq2_padded)

print(repr(dseq1))

gives a dseq with mismatches

Dseq(-18)
aaaACGGCAGCCCGTaaa
cccTGCCGTCGGGCAccc

I wonder if we should constrict the representation to have no mismatches (e.g. use terminal_overlap instead of common_substrings)? Or give an error if one like this comes up?

BjornFJohansson commented 2 months ago

This was by design. It is there so that we can make staggered sequences like so:

from pydna.dseq import Dseq
from pydna.utils import rc

seq1 = "ACGGCAGCCCGT"
seq2 = rc(seq1)

seq1_padded = "aaa" + seq1
seq2_padded = "ccc" + seq2

dseq1 = Dseq(seq1_padded, seq2_padded)

print(repr(dseq1))
Dseq(-18)
aaaACGGCAGCCCGT
   TGCCGTCGGGCAccc

Does this create problems in other use cases? Maybe a warning would be appropriate.

manulera commented 2 months ago

Hi @BjornFJohansson, in my example the returned sequence has mismatches at both ends, that's the problematic bit.

If you are manually typing both strands, you may make a mistake when typing one of them, and you may want to get an error in that case.

You can create a sequence with mismatches and stagger by passing the overhang, but not sure the auto-find of overhangs should be returning sequences with mismatches. In general, most functions of pydna will give unexpected behavior if there are mistmaches I guess? So I think an error would be good