Issue 157 - Githubissues

Hi @BjornFJohansson here is an alternative implementation of the cutting functionality that follows what was mentioned in #157.

It's a big one, so no problem if you take a while to review it. It decreases significantly the lines of code. The difference in total lines is a net positive, but I have removed only code, and added quite a bit of comments / docstrings. The current implementation uses more of the built-in biopython functionality, which currently supports searching for cutsites in circular molecules.

The function cut in Dseq is now splitted into three functions:


cutsites = self.get_cutsites(*enzymes)
cutsite_pairs = self.get_cutsite_pairs(cutsites)
return tuple(self.apply_cut(*cs) for cs in cutsite_pairs)

get_cutsites

get_cutsites finds cutsites in the sequence, returned as a list of tuple[tuple[int,int], _RestrictionType], sorted by where they cut on the 5' strand.

For a given cutsite, e.g. [(3, 7), EcoRI]:

The first position in the tuple (3) is the position where the enzyme will cut in the watson strand(same meaning as enzyme.search() - 1 in biopython)
The second position in the tuple (7) is the position where the enzyme will cut in the crick strand.
EcoRI is the enzyme python object

>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> seq = Dseq('AAGAATTCAAGAATTC')
>>> seq.get_cutsites(EcoRI)
[((3, 7), EcoRI), ((11, 15), EcoRI)]

This is a convenient representation, and you can see why in the function apply_cut, where two such cuts are passed as inputs (ignore the twoif xxxx is not None for now).


def apply_cut(self, left_cut, right_cut):

        left_watson, left_crick = left_cut[0] if left_cut is not None else ((self.ovhg, 0) if self.ovhg > 0 else (0, -self.ovhg))
        ovhg = self.ovhg if left_cut is None else left_cut[1].ovhg
        right_watson, right_crick = right_cut[0] if right_cut is not None else (len(self.watson), len(self.crick))
        return Dseq(
                    str(self[left_watson:right_watson]),
                    # The line below could be easier to understand as _rc(str(self[left_crick:right_crick])), but it does not preserve the case
                    str(self.reverse_complement()[len(self) - right_crick:len(self) - left_crick]),
                    ovhg=ovhg,
                )

get_cutsite_pairs

This pairs the cutsites 2 by 2 to render the edges of the resulting fragments.

def get_cutsite_pairs(self, cutsites):
        """ Pairs the cutsites 2 by 2 to render the edges of the resulting fragments.

        Special cases:
        - Single cutsite on circular sequence: returns a pair where both cutsites are the same
        - Linear sequence:
            - creates a new left_cut on the first pair equal to `None` to represent the left edge of the sequence as it is.
            - creates a new right_cut on the last pair equal to `None` to represent the right edge of the sequence as it is.
            - In both new cuts, the enzyme is set to None to indicate that the cut is not made by an enzyme.

        Parameters
        ----------
        cutsites : list[tuple[tuple[int,int], _RestrictionType]]

        Returns
        -------
        list[tuple[tuple[tuple[int,int], _RestrictionType]|None],tuple[tuple[int,int], _RestrictionType]|None]

        Examples
        --------

        >>> from Bio.Restriction import EcoRI
        >>> from pydna.dseq import Dseq
        >>> seq = Dseq('AAGAATTCAAGAATTC')
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(None, ((3, 7), EcoRI)), (((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), None)]
        >>> seq = Dseq('AAGAATTCAAGAATTC', circular=True)
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), ((3, 7), EcoRI))]
        >>> seq = Dseq('AAGAATTCAA', circular=True)
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(((3, 7), EcoRI), ((3, 7), EcoRI))]
        """
        if len(cutsites) == 0:
            return []
        if not self.circular:
            cutsites = [None, *cutsites, None]
        else:
            # Add the first cutsite at the end, for circular cuts
            cutsites.append(cutsites[0])

        return list(zip(cutsites, cutsites[1:]))

apply_cut

Extracts a fragment from a sequence based on a pair of cuts, the code is above, and you can see now the case for when the enzyme is set to None (special case for the edges of a linear molecule).

Extra things / thoughts

The pos property of Dseq could be now removed
CutSite could be made into a class with properties cut_watson, cut_crick, enzyme (would probably give more clarity).
~~The pairs at the edges could simply be (None, ((3, 7), EcoRI)) instead of (((0, 0), None), ((3, 7), EcoRI)), perhaps clearer.~~ I ended up doing this, which I think is a bit clearer. With https://github.com/BjornFJohansson/pydna/pull/163/commits/0edcd200f576941f3534f1136e5daffe31c4de62
The methods can be renamed, perhaps cut_fragment would be more clear?
~~I have added the dependency of more_itertools, for the function pairwise. This come with normal itertools in python 3.10, but not older versions.~~ Not anymore (https://github.com/BjornFJohansson/pydna/pull/163/commits/bf8aee2c1bf15ea674750cf9d9020e7d94cb04f7)

Back compatibility

The only problem is that the cuts are returned in the same order regardless of the order of the input enzymes. I think this is a preferable behaviour, but I could make it back-compatible. I have modified some tests so that they test for the new behaviour, see test_module_dseqrecord.py, the line that says @pytest.mark.xfail(reason="issue #78"), and the lines in the test files that start with # TODO:, you can easily find them in the page of the diff of the PR.

BjornFJohansson / pydna

Issue 157 #163

get_cutsites

get_cutsite_pairs

apply_cut

Extra things / thoughts

Back compatibility

Codecov Report