Hi @BjornFJohansson here is an alternative implementation of the cutting functionality that follows what was mentioned in #157.
It's a big one, so no problem if you take a while to review it. It decreases significantly the lines of code. The difference in total lines is a net positive, but I have removed only code, and added quite a bit of comments / docstrings. The current implementation uses more of the built-in biopython functionality, which currently supports searching for cutsites in circular molecules.
The function cut in Dseq is now splitted into three functions:
cutsites = self.get_cutsites(*enzymes)
cutsite_pairs = self.get_cutsite_pairs(cutsites)
return tuple(self.apply_cut(*cs) for cs in cutsite_pairs)
get_cutsites
get_cutsites finds cutsites in the sequence, returned as a list of tuple[tuple[int,int], _RestrictionType], sorted by where they cut on the 5' strand.
For a given cutsite, e.g. [(3, 7), EcoRI]:
The first position in the tuple (3) is the position where the enzyme will cut in the watson strand(same meaning as enzyme.search() - 1 in biopython)
The second position in the tuple (7) is the position where the enzyme will cut in the crick strand.
This is a convenient representation, and you can see why in the function apply_cut, where two such cuts are passed as inputs (ignore the twoif xxxx is not None for now).
def apply_cut(self, left_cut, right_cut):
left_watson, left_crick = left_cut[0] if left_cut is not None else ((self.ovhg, 0) if self.ovhg > 0 else (0, -self.ovhg))
ovhg = self.ovhg if left_cut is None else left_cut[1].ovhg
right_watson, right_crick = right_cut[0] if right_cut is not None else (len(self.watson), len(self.crick))
return Dseq(
str(self[left_watson:right_watson]),
# The line below could be easier to understand as _rc(str(self[left_crick:right_crick])), but it does not preserve the case
str(self.reverse_complement()[len(self) - right_crick:len(self) - left_crick]),
ovhg=ovhg,
)
get_cutsite_pairs
This pairs the cutsites 2 by 2 to render the edges of the resulting fragments.
def get_cutsite_pairs(self, cutsites):
""" Pairs the cutsites 2 by 2 to render the edges of the resulting fragments.
Special cases:
- Single cutsite on circular sequence: returns a pair where both cutsites are the same
- Linear sequence:
- creates a new left_cut on the first pair equal to `None` to represent the left edge of the sequence as it is.
- creates a new right_cut on the last pair equal to `None` to represent the right edge of the sequence as it is.
- In both new cuts, the enzyme is set to None to indicate that the cut is not made by an enzyme.
Parameters
----------
cutsites : list[tuple[tuple[int,int], _RestrictionType]]
Returns
-------
list[tuple[tuple[tuple[int,int], _RestrictionType]|None],tuple[tuple[int,int], _RestrictionType]|None]
Examples
--------
>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> seq = Dseq('AAGAATTCAAGAATTC')
>>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
[(None, ((3, 7), EcoRI)), (((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), None)]
>>> seq = Dseq('AAGAATTCAAGAATTC', circular=True)
>>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
[(((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), ((3, 7), EcoRI))]
>>> seq = Dseq('AAGAATTCAA', circular=True)
>>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
[(((3, 7), EcoRI), ((3, 7), EcoRI))]
"""
if len(cutsites) == 0:
return []
if not self.circular:
cutsites = [None, *cutsites, None]
else:
# Add the first cutsite at the end, for circular cuts
cutsites.append(cutsites[0])
return list(zip(cutsites, cutsites[1:]))
apply_cut
Extracts a fragment from a sequence based on a pair of cuts, the code is above, and you can see now the case for when the enzyme is set to None (special case for the edges of a linear molecule).
Extra things / thoughts
The pos property of Dseq could be now removed
CutSite could be made into a class with properties cut_watson, cut_crick, enzyme (would probably give more clarity).
The only problem is that the cuts are returned in the same order regardless of the order of the input enzymes. I think this is a preferable behaviour, but I could make it back-compatible. I have modified some tests so that they test for the new behaviour, see test_module_dseqrecord.py, the line that says @pytest.mark.xfail(reason="issue #78"), and the lines in the test files that start with # TODO:, you can easily find them in the page of the diff of the PR.
Hi @BjornFJohansson here is an alternative implementation of the cutting functionality that follows what was mentioned in #157.
It's a big one, so no problem if you take a while to review it. It decreases significantly the lines of code. The difference in total lines is a net positive, but I have removed only code, and added quite a bit of comments / docstrings. The current implementation uses more of the built-in biopython functionality, which currently supports searching for cutsites in circular molecules.
The function cut in
Dseq
is now splitted into three functions:get_cutsites
get_cutsites
finds cutsites in the sequence, returned as a list oftuple[tuple[int,int], _RestrictionType]
, sorted by where they cut on the 5' strand.For a given cutsite, e.g.
[(3, 7), EcoRI]
:enzyme.search() - 1
in biopython)This is a convenient representation, and you can see why in the function
apply_cut
, where two such cuts are passed as inputs (ignore the twoif xxxx is not None
for now).get_cutsite_pairs
This pairs the cutsites 2 by 2 to render the edges of the resulting fragments.
apply_cut
Extracts a fragment from a sequence based on a pair of cuts, the code is above, and you can see now the case for when the enzyme is set to
None
(special case for the edges of a linear molecule).Extra things / thoughts
The
pos
property of Dseq could be now removedCutSite
could be made into a class with propertiescut_watson
,cut_crick
,enzyme
(would probably give more clarity).The pairs at the edges could simply beI ended up doing this, which I think is a bit clearer. With https://github.com/BjornFJohansson/pydna/pull/163/commits/0edcd200f576941f3534f1136e5daffe31c4de62(None, ((3, 7), EcoRI))
instead of(((0, 0), None), ((3, 7), EcoRI))
, perhaps clearer.The methods can be renamed, perhaps
cut_fragment
would be more clear?I have added the dependency ofNot anymore (https://github.com/BjornFJohansson/pydna/pull/163/commits/bf8aee2c1bf15ea674750cf9d9020e7d94cb04f7)more_itertools
, for the functionpairwise
. This come with normal itertools in python 3.10, but not older versions.Back compatibility
The only problem is that the cuts are returned in the same order regardless of the order of the input enzymes. I think this is a preferable behaviour, but I could make it back-compatible. I have modified some tests so that they test for the new behaviour, see
test_module_dseqrecord.py
, the line that says@pytest.mark.xfail(reason="issue #78")
, and the lines in the test files that start with# TODO:
, you can easily find them in the page of the diff of the PR.