ENH: add SapI enzyme restriction sites

alexsongrj commented 3 years ago

DNAweaver is a cool library to perform golden gate assembly. We would like to introduce "SapI" restriction enzyme to Golden Gate, change code in below. it is working for our sequence. Might need a more random sequence to test this. welcome to further discuss one of the changes.

jlerman44 commented 3 years ago

Below is a script I wrote where 100 random text fixtures are generated and then tested. It works so long as half_homology = 2. If we let it evaluate to 3 // 2 == 1 per the code in the master branch we cannot get the output to pass design verification by dnacauldron. @Zulko and @veghp, can you see if this might make sense? Or if it would be bad for other designs to hardcode this to 2?

import random
import dnaweaver as dw
from dnaweaver import TmSegmentSelector
import dnacauldron as dc
import pandas as pd
import os

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from geneblocks import  sequences_are_circularly_equal

def DNA(length=int(), letters="CGTA"):
    return''.join(random.choices(letters, k=length))

NUM_TRIALS = 100

for i in range(0, NUM_TRIALS):
    sequences = {
        'part1': DNA(1000),
        'part2': DNA(1000),
        'part3': DNA(1000),
    }

    desired_sequence = sequences['part1'] + sequences['part2'] + sequences['part3']
    # make sure desired has no SapI site.
    desired_sequence = desired_sequence.replace('GCTCTTC', 'GCGGAGC').replace('GAAGAGC', 'GATTTTC')
    desired_sequence = dw.SequenceString(desired_sequence, metadata={"topology": 'circular'})

    oligos_company = dw.CommercialDnaOffer(
        "OligoCompany",
        sequence_constraints=[dw.SequenceLengthConstraint(max_length=200)],
        pricing=dw.PerBasepairPricing(0.1)
    )
    pcr_station = dw.PcrExtractionStation(
        name="PCR station",
        max_overhang_length=50,
        primers_supplier=oligos_company,
        sequences=sequences,
        homology_selector=TmSegmentSelector(min_tm=35, max_tm=70),
        extra_cost=5
    )
    assembly_station = dw.DnaAssemblyStation(
        name="Golden Gate assembly",
        assembly_method = dw.GoldenGateAssemblyMethod(
            enzyme='SapI',
            min_segment_length=100,
            max_segment_length=10000,
            wildcard_basepair='G',
            left_addition='TCCTAG',
            right_addition='CTAGGA',
        ),
        supplier=pcr_station,
        coarse_grain=5,
        fine_grain=None,
        logger='bar'
    )

    # THIS LINE WILL PRE-BLAST THE SEQUENCE TO ACCELERATE COMPUTATIONS.
    assembly_station.prepare_network_on_sequence(desired_sequence)

    # FIND AN ASSEMBLY PLAN AND PRINT IT.
    quote = assembly_station.get_quote(desired_sequence)
    print(quote)
    assembly_report = quote.to_assembly_plan_report()
    try:
        assembly_report.write_full_report(f"Design{i}")
        print("Success! See figure assembly_plans.png.")
    except Exception as e:
        print("Error encountered....")
        print(e)
        raise

    # now verify things are correct
    df = pd.read_csv(f'Design{i}/sequences.csv', sep=';')
    target_sequence = df[df.Source=="Golden Gate assembly"].Sequence.unique()[0]
    df = df[df.Source=='PCR station']

    parts = {}
    ids = []
    for _, row in df.iterrows():
        new_seq_record = SeqRecord(Seq(row.Sequence))
        new_seq_record.id = row.ID
        new_seq_record.description = row.ID
        new_seq_record.name = row.ID
        new_seq_record.annotations['topology'] = 'linear'
        parts[row.ID] = new_seq_record
        ids.append(row.ID)

    repository = dc.SequenceRepository(collections={'my_collection': parts})

    assembly = dc.Type2sRestrictionAssembly(parts=ids, enzyme='SapI')
    simulation = assembly.simulate(sequence_repository=repository)

    # Print the ID and length of the generated construct(s)
    for record in simulation.construct_records:
        print (record.id, len(record))

    # Get a list of dictionnaries with data on each construct
    constructs_data = simulation.compute_all_construct_data_dicts()

    report_writer = dc.AssemblyReportWriter(
        include_mix_graphs=True, include_part_plots=True
    )

    # Write a full report with sequences and figures in a zip.
    simulation.write_report(f"Design{i}", report_writer=report_writer)
    assert len(simulation.construct_records) == 1
    predicted_construct_record = simulation.construct_records[0]

    assert sequences_are_circularly_equal([predicted_construct_record, target_sequence])
    print("VALID!")

Zulko commented 3 years ago

Thanks both! I'm not sure exactly what the underlying problem was but happy that your changes work. The problem with the hard-coded 2 is that it won't work for longer homologies (Gibson Assembly, yeast recombination etc.). This makes test_circular_sequences.py fail. Would it be possible to replace these lines with the following, which should still work for you, and makes the current tests pass?

half_homology = int(numpy.ceil(self.max_homology_size / 2))

Would it be possible to create a minimal test (e.g. based on the script above, but without the trials loop) to add to the test suite?

jlerman44 commented 3 years ago

@Zulko, sounds good. @alexsongrj and I will make this change, add a minimal test, and ensure all other tests pass.

alexsongrj commented 3 years ago

@Zulko we made a minimal test in test_spai_enzyme_restriction_site.py with a random sequence plasmid. Would you mind take a look? thanks:)

jlerman44 commented 3 years ago

@veghp - good to go?

Zulko commented 3 years ago

This looks good to me. @veghp are you ok to merge and release?

veghp commented 3 years ago

Missed the last question, but now merged and released. Also fixed a typo in the test file name. Thanks for the contribution!

Edinburgh-Genome-Foundry / DnaWeaver

ENH: add SapI enzyme restriction sites #10