Alternative assembly implementation

manulera commented 11 months ago

Hi @BjornFJohansson I had started working on a similar refactoring to #157 for the assemblies (find which assemblies can be made with one function, execute them with another). To get this to work, I had to change things significantly, so my implementation has become quite different. I still use a graph, but the meaning of edges is different. I have not implemented the Contig functionality, but I could do something similar, if you think it would make sense to include this assembly in pydna. For now, it is in the ShareYourCloning repo.

You can find the implementation in this_file. It passes all the tests that pydna's assembly passes, excluding the order in which fragments are returned, and that it only returns unique assemblies (see test file).

If you want to have a look, I would start from the docstring of Assembly, and then add_edges_from_match to understand how it works. I think it's quite simplified from the original implementation, for example, the construction of the graph is very short (the function add_edges_from_match is quite short as well):

def __init__(self, frags: list[_Dseqrecord], limit=25, algorithm=common_sub_strings, use_fragment_order=True, use_all_fragments=False):
    # TODO: allow for the same fragment to be included more than once?
    G = _nx.MultiDiGraph()
    # Add positive and negative nodes for forward and reverse fragments
    G.add_nodes_from((i + 1, {'seq': f}) for (i, f) in enumerate(frags))
    G.add_nodes_from((-(i + 1), {'seq': f.reverse_complement()}) for (i, f) in enumerate(frags))

    # Iterate over all possible combinations of fragments
    edge_pairs = _itertools.combinations(filter(lambda x : x>0, G.nodes), 2)
    for index_first, index_secnd in edge_pairs:
        first = G.nodes[index_first]['seq']
        secnd = G.nodes[index_secnd]['seq']

        # Overlaps where both fragments are in the forward orientation
        matches_fwd = algorithm(str(first.seq).upper(), str(secnd.seq).upper(), limit)
        for match in matches_fwd:
            add_edges_from_match(match, index_first, index_secnd, first, secnd, G)

        # Overlaps where the first fragment is in the forward orientation and the second in the reverse orientation
        matches_rvs = algorithm(str(first.seq).upper(), reverse_complement(str(secnd.seq).upper()), limit)
        for match in matches_rvs:
            add_edges_from_match(match, index_first, -index_secnd, first, secnd, G)

Let me know what you think.

BjornFJohansson commented 11 months ago

OK, lots to digest. In principle I don't mind putting this alongside and eventually replace the existing code. Ill read it carefully and get back to you.

manulera commented 11 months ago

Yes, actually the code has slightly changed since, although it is still in the same spirit. I am currently using that new assembly module to do PCR, ligations (also partial) and homologous recombination. I will adapt for gibson and golden gate as well. I am just writing slightly different algorithm functions for each case.

If you think it can be useful, I can give you a short overview on zoom before you dive into it.

manulera commented 10 months ago

Hi @BjornFJohansson, when you merge the branches you mentioned in the call into cutsite_pairs, let me know and I can add the new assembly implementation.

BjornFJohansson / pydna

Alternative assembly implementation #165