BjornFJohansson / pydna

Clone with Python! Data structures for double stranded DNA & simulation of homologous recombination, Gibson assembly, cut & paste cloning.
Other
166 stars 45 forks source link

Issue 157 #163

Closed manulera closed 11 months ago

manulera commented 12 months ago

Hi @BjornFJohansson here is an alternative implementation of the cutting functionality that follows what was mentioned in #157.

It's a big one, so no problem if you take a while to review it. It decreases significantly the lines of code. The difference in total lines is a net positive, but I have removed only code, and added quite a bit of comments / docstrings. The current implementation uses more of the built-in biopython functionality, which currently supports searching for cutsites in circular molecules.

The function cut in Dseq is now splitted into three functions:


cutsites = self.get_cutsites(*enzymes)
cutsite_pairs = self.get_cutsite_pairs(cutsites)
return tuple(self.apply_cut(*cs) for cs in cutsite_pairs)

get_cutsites

get_cutsites finds cutsites in the sequence, returned as a list of tuple[tuple[int,int], _RestrictionType], sorted by where they cut on the 5' strand.

For a given cutsite, e.g. [(3, 7), EcoRI]:

>>> from Bio.Restriction import EcoRI
>>> from pydna.dseq import Dseq
>>> seq = Dseq('AAGAATTCAAGAATTC')
>>> seq.get_cutsites(EcoRI)
[((3, 7), EcoRI), ((11, 15), EcoRI)]

This is a convenient representation, and you can see why in the function apply_cut, where two such cuts are passed as inputs (ignore the twoif xxxx is not None for now).


def apply_cut(self, left_cut, right_cut):

        left_watson, left_crick = left_cut[0] if left_cut is not None else ((self.ovhg, 0) if self.ovhg > 0 else (0, -self.ovhg))
        ovhg = self.ovhg if left_cut is None else left_cut[1].ovhg
        right_watson, right_crick = right_cut[0] if right_cut is not None else (len(self.watson), len(self.crick))
        return Dseq(
                    str(self[left_watson:right_watson]),
                    # The line below could be easier to understand as _rc(str(self[left_crick:right_crick])), but it does not preserve the case
                    str(self.reverse_complement()[len(self) - right_crick:len(self) - left_crick]),
                    ovhg=ovhg,
                )

get_cutsite_pairs

This pairs the cutsites 2 by 2 to render the edges of the resulting fragments.

def get_cutsite_pairs(self, cutsites):
        """ Pairs the cutsites 2 by 2 to render the edges of the resulting fragments.

        Special cases:
        - Single cutsite on circular sequence: returns a pair where both cutsites are the same
        - Linear sequence:
            - creates a new left_cut on the first pair equal to `None` to represent the left edge of the sequence as it is.
            - creates a new right_cut on the last pair equal to `None` to represent the right edge of the sequence as it is.
            - In both new cuts, the enzyme is set to None to indicate that the cut is not made by an enzyme.

        Parameters
        ----------
        cutsites : list[tuple[tuple[int,int], _RestrictionType]]

        Returns
        -------
        list[tuple[tuple[tuple[int,int], _RestrictionType]|None],tuple[tuple[int,int], _RestrictionType]|None]

        Examples
        --------

        >>> from Bio.Restriction import EcoRI
        >>> from pydna.dseq import Dseq
        >>> seq = Dseq('AAGAATTCAAGAATTC')
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(None, ((3, 7), EcoRI)), (((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), None)]
        >>> seq = Dseq('AAGAATTCAAGAATTC', circular=True)
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(((3, 7), EcoRI), ((11, 15), EcoRI)), (((11, 15), EcoRI), ((3, 7), EcoRI))]
        >>> seq = Dseq('AAGAATTCAA', circular=True)
        >>> seq.get_cutsite_pairs(seq.get_cutsites(EcoRI))
        [(((3, 7), EcoRI), ((3, 7), EcoRI))]
        """
        if len(cutsites) == 0:
            return []
        if not self.circular:
            cutsites = [None, *cutsites, None]
        else:
            # Add the first cutsite at the end, for circular cuts
            cutsites.append(cutsites[0])

        return list(zip(cutsites, cutsites[1:]))

apply_cut

Extracts a fragment from a sequence based on a pair of cuts, the code is above, and you can see now the case for when the enzyme is set to None (special case for the edges of a linear molecule).

Extra things / thoughts

Back compatibility

The only problem is that the cuts are returned in the same order regardless of the order of the input enzymes. I think this is a preferable behaviour, but I could make it back-compatible. I have modified some tests so that they test for the new behaviour, see test_module_dseqrecord.py, the line that says @pytest.mark.xfail(reason="issue #78"), and the lines in the test files that start with # TODO:, you can easily find them in the page of the diff of the PR.

codecov[bot] commented 12 months ago

Codecov Report

Merging #163 (31f36de) into develop (5e7c5dc) will increase coverage by 0.01%. The diff coverage is 95.23%.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/BjornFJohansson/pydna/pull/163/graphs/tree.svg?width=650&height=150&src=pr&token=8gKFpo7HAl&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Bj%C3%B6rn+Johansson)](https://app.codecov.io/gh/BjornFJohansson/pydna/pull/163?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Bj%C3%B6rn+Johansson) ```diff @@ Coverage Diff @@ ## develop #163 +/- ## =========================================== + Coverage 93.94% 93.95% +0.01% =========================================== Files 36 36 Lines 3617 3575 -42 Branches 560 543 -17 =========================================== - Hits 3398 3359 -39 Misses 183 183 + Partials 36 33 -3 ``` | [Files](https://app.codecov.io/gh/BjornFJohansson/pydna/pull/163?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Bj%C3%B6rn+Johansson) | Coverage Δ | | |---|---|---| | [src/pydna/dseq.py](https://app.codecov.io/gh/BjornFJohansson/pydna/pull/163?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Bj%C3%B6rn+Johansson#diff-c3JjL3B5ZG5hL2RzZXEucHk=) | `96.20% <100.00%> (+0.86%)` | :arrow_up: | | [src/pydna/dseqrecord.py](https://app.codecov.io/gh/BjornFJohansson/pydna/pull/163?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Bj%C3%B6rn+Johansson#diff-c3JjL3B5ZG5hL2RzZXFyZWNvcmQucHk=) | `95.59% <88.88%> (-0.51%)` | :arrow_down: |