Linear assembly producing misannotated sequences

JamesBagley commented 1 year ago

I've noticed in certain cases the Assembly.assemble_linear() method can produce misannotated sequences, specifically it seems to shift the annotations along the sequence, so they have the same length, same name, but correspond to different basepairs.

The test case below should reproduce the error, its based on a modified version of the original assemble_linear() test case you have in pydna/tests.py

The difference in test is that a) annotations are added, and b) an extra base is added to Dseqrecord c

If the first G is removed from Dseqrecord c, the test passes. Introducing extra bases to 5' or 3' ends of Dseqrecords a & b does not produce an error.

def test_linear_with_annotations2(monkeypatch):
    from pydna._pretty import pretty_str
    from pydna.assembly import Assembly
    from pydna.dseqrecord import Dseqrecord

    a = Dseqrecord("acgatgctatactgtgCCNCCtgtgctgtgctcta")
    a.add_feature(0,10,label='a_feat')
    a_feat_seq = a.features[0].extract(a)
    # 12345678901234
    b = Dseqrecord("tgtgctgtgctctaTTTTTTTtattctggctgtatcCCCCCC")
    b.add_feature(0,10,label='b_feat')
    b_feat_seq = b.features[0].extract(b)

    # 123456789012345
    c = Dseqrecord("GtattctggctgtatcGGGGGtacgatgctatactgtg")
    c.add_feature(0,10,label='c_feat')
    c_feat_seq = c.features[0].extract(c)

    feature_sequences = {'a_feat':a_feat_seq,
                         'b_feat':b_feat_seq,
                         'c_feat':c_feat_seq}

    a.name = "aaa"  # 1234567890123456
    b.name = "bbb"
    c.name = "ccc"
    asm = Assembly((a, b, c), limit=14)
    x = asm.assemble_linear()[0]
    #print(x.features)
    #print(x)
    answer = 'aaa|14\n    \\/\n    /\\\n    14|bbb|15\n           \\/\n           /\\\n           15|ccc'

    assert x.figure() == answer.strip()
    answer = 'acgatgctatactgtgCCNCCtgtgctgtgctcta\n                     TGTGCTGTGCTCTA\n                     tgtgctgtgctctaTTTTTTTtattctggctgtatc\n                                          TATTCTGGCTGTATC\n                                          tattctggctgtatcGGGGGtacgatgctatactgtg\n'
    assert x.detailed_figure()
    for feat in x.features:

        try:
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
        except(AssertionError):
            print(feat.qualifiers['label'])
            print(feat.extract(x).seq, 'extracted feat')
            print(feature_sequences[feat.qualifiers['label']].seq, 'original feat')
            assert feat.extract(x).seq == feature_sequences[feat.qualifiers['label']].seq
test_linear_with_annotations2('')

I've also implemented it as a colab notebook with the original test case, and a case where the assembled sequence is annotated correctly https://colab.research.google.com/drive/1akdSdrVGu7w5mD2jd7HJ-hD17J4zApJj?usp=sharing

Thank you for creating such a brilliant (and beautifully written) package

BjornFJohansson commented 1 year ago

Hi, and thanks for the positive review. Thank you for taking the time to report this and making pydna better. I think I have solved this bug and a new alpha version will be available shortly.

manulera commented 2 months ago

Hi @BjornFJohansson I think I had mentioned this somewhere, but I could not find where. This keeps happenning, but I won't fix it here since the new assembly implementation does not have that problem. I was preparing a minimal example for #255 and I came across this error, and a similar one in the PCR module (#262).

In the example below, I amplify a fragment from pombe's genome containing the gene ase1, and clone it into an expression vector using Gibson assembly:

Files to reproduce: reproduce.zip

from pydna.design import assembly_fragments, primer_design
from pydna.parsers import parse
from pydna.dseqrecord import Dseqrecord
from pydna.amplicon import Amplicon
from pydna.amplify import pcr
from pydna.assembly import Assembly
from pydna.common_sub_strings import terminal_overlap

# Read the plasmid
vector: Dseqrecord = parse('pREP42-MCS+.gb')[0]

# Read the gene we want to clone
genomic_dna: Dseqrecord = parse('ase1.gb')[0]

# Select the ase1 CDS from the feature
ase1_feature = next(
    f for f in genomic_dna.features if (f.type == 'CDS' and 'gene' in f.qualifiers and 'ase1' in f.qualifiers['gene'])
)

# Select the entire plasmid, shifted so that the end of the promoter is at the origin
promoter_feature = next(
    f
    for f in vector.features
    if f.type == 'promoter' and 'label' in f.qualifiers and 'nmt1 P41 promoter' in f.qualifiers['label']
)
vector_shifted = vector.shifted(promoter_feature.location.end)

# Design primers (without overhangs)
vector_amplicon_pre = primer_design(vector_shifted)
# here we use indexing instead of feature extract to keep the features
ase1_amplicon_pre = primer_design(genomic_dna[ase1_feature.location.start : ase1_feature.location.end])

# Design assembly primers (same as previous, but with overhangs for Gibson assembly)
# We include the vector twice to create a circular recombination
asm: list[Amplicon] = assembly_fragments([vector_amplicon_pre, ase1_amplicon_pre, vector_amplicon_pre])

# For the vector, we use the forward primer of the last amplicon and the reverse primer of the first amplicon
print('vector fwd:', asm[2].primers()[0].seq)
print('vector rev:', asm[0].primers()[1].seq)
print()
vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector_shifted)

# This should give the same result but it doesn't, probably a bug
# in PCR
# vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector)

# For the ase1, we keep it as is
print('ase1 fwd:', asm[1].primers()[0].seq)
print('ase1 rev:', asm[1].primers()[1].seq)
print()
ase1_amplicon = asm[1]

# Do the Gibson assembly
asm = Assembly([vector_amplicon, ase1_amplicon], algorithm=terminal_overlap)

# The ase1 CDS feature is misanotated here, see issue
product = asm.assemble_circular()[0]

print(product)

Several features become misannotated:

The CDS of ase1 (shifted by 18, the homology region of the Gibson assembly)
The promoter (it ends at position 1, but should end just before the origin)
A primer (it ends at position 1, but should end just before the origin)

You can see it in the image:

Screenshot 2024-09-13 at 13 37 20

And this is how the pnmt1 and primer annotations look:

     promoter        join(7162..8315,1)
                     /label="nmt1 P41 promoter"
                     /note="mutant nmt1 promoter from Schizosaccharomyces pombe,
                     conferring medium strength thiamine-repressible expression"
     primer_bind     complement(join(8294..8315,1))
                     /label="r8315"
                     /PCR_conditions="primer
                     sequence:CATCATTACTGTTTGCATTGATTTAACAAAGCGACTATAAG"
                     /ApEinfo_fwdcolor="#baffa3"
                     /ApEinfo_revcolor="#ffbaba"

JamesBagley commented 2 months ago

As a workaround for my initial problem I used this to get around the issue. I had three fragments assembling in a linear fashion, and I imagine it's exclusively useful for that, but maybe it'll be informative anyway

repaired_fragment = Assembly(fragments).assemble_linear()[0]
switches = list(map(SeqRecord, list(repaired_fragment.graph.edges.items())[1][0]))
      UP_trimmed = UP[:UP.find(switches[0])+len(switches[0])]
      DW_trimmed = DW[DW.find(switches[1]):]
      donor_trimmed = donor[donor.find(switches[0]):donor.find(switches[1])+len(switches[1])]
      fragments = [UP_trimmed] + [donor_trimmed] + [DW_trimmed]
      repaired_fragment = Assembly(fragments).assemble_linear()

BjornFJohansson / pydna

Linear assembly producing misannotated sequences #136