Open JamesBagley opened 1 year ago
Hi, and thanks for the positive review. Thank you for taking the time to report this and making pydna better. I think I have solved this bug and a new alpha version will be available shortly.
Hi @BjornFJohansson I think I had mentioned this somewhere, but I could not find where. This keeps happenning, but I won't fix it here since the new assembly implementation does not have that problem. I was preparing a minimal example for #255 and I came across this error, and a similar one in the PCR module (#262).
In the example below, I amplify a fragment from pombe's genome containing the gene ase1, and clone it into an expression vector using Gibson assembly:
Files to reproduce: reproduce.zip
from pydna.design import assembly_fragments, primer_design
from pydna.parsers import parse
from pydna.dseqrecord import Dseqrecord
from pydna.amplicon import Amplicon
from pydna.amplify import pcr
from pydna.assembly import Assembly
from pydna.common_sub_strings import terminal_overlap
# Read the plasmid
vector: Dseqrecord = parse('pREP42-MCS+.gb')[0]
# Read the gene we want to clone
genomic_dna: Dseqrecord = parse('ase1.gb')[0]
# Select the ase1 CDS from the feature
ase1_feature = next(
f for f in genomic_dna.features if (f.type == 'CDS' and 'gene' in f.qualifiers and 'ase1' in f.qualifiers['gene'])
)
# Select the entire plasmid, shifted so that the end of the promoter is at the origin
promoter_feature = next(
f
for f in vector.features
if f.type == 'promoter' and 'label' in f.qualifiers and 'nmt1 P41 promoter' in f.qualifiers['label']
)
vector_shifted = vector.shifted(promoter_feature.location.end)
# Design primers (without overhangs)
vector_amplicon_pre = primer_design(vector_shifted)
# here we use indexing instead of feature extract to keep the features
ase1_amplicon_pre = primer_design(genomic_dna[ase1_feature.location.start : ase1_feature.location.end])
# Design assembly primers (same as previous, but with overhangs for Gibson assembly)
# We include the vector twice to create a circular recombination
asm: list[Amplicon] = assembly_fragments([vector_amplicon_pre, ase1_amplicon_pre, vector_amplicon_pre])
# For the vector, we use the forward primer of the last amplicon and the reverse primer of the first amplicon
print('vector fwd:', asm[2].primers()[0].seq)
print('vector rev:', asm[0].primers()[1].seq)
print()
vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector_shifted)
# This should give the same result but it doesn't, probably a bug
# in PCR
# vector_amplicon = pcr(asm[2].primers()[0], asm[0].primers()[1], vector)
# For the ase1, we keep it as is
print('ase1 fwd:', asm[1].primers()[0].seq)
print('ase1 rev:', asm[1].primers()[1].seq)
print()
ase1_amplicon = asm[1]
# Do the Gibson assembly
asm = Assembly([vector_amplicon, ase1_amplicon], algorithm=terminal_overlap)
# The ase1 CDS feature is misanotated here, see issue
product = asm.assemble_circular()[0]
print(product)
Several features become misannotated:
You can see it in the image:
And this is how the pnmt1 and primer annotations look:
promoter join(7162..8315,1)
/label="nmt1 P41 promoter"
/note="mutant nmt1 promoter from Schizosaccharomyces pombe,
conferring medium strength thiamine-repressible expression"
primer_bind complement(join(8294..8315,1))
/label="r8315"
/PCR_conditions="primer
sequence:CATCATTACTGTTTGCATTGATTTAACAAAGCGACTATAAG"
/ApEinfo_fwdcolor="#baffa3"
/ApEinfo_revcolor="#ffbaba"
As a workaround for my initial problem I used this to get around the issue. I had three fragments assembling in a linear fashion, and I imagine it's exclusively useful for that, but maybe it'll be informative anyway
repaired_fragment = Assembly(fragments).assemble_linear()[0]
switches = list(map(SeqRecord, list(repaired_fragment.graph.edges.items())[1][0]))
UP_trimmed = UP[:UP.find(switches[0])+len(switches[0])]
DW_trimmed = DW[DW.find(switches[1]):]
donor_trimmed = donor[donor.find(switches[0]):donor.find(switches[1])+len(switches[1])]
fragments = [UP_trimmed] + [donor_trimmed] + [DW_trimmed]
repaired_fragment = Assembly(fragments).assemble_linear()
I've noticed in certain cases the Assembly.assemble_linear() method can produce misannotated sequences, specifically it seems to shift the annotations along the sequence, so they have the same length, same name, but correspond to different basepairs.
The test case below should reproduce the error, its based on a modified version of the original assemble_linear() test case you have in pydna/tests.py
The difference in test is that a) annotations are added, and b) an extra base is added to Dseqrecord c
If the first G is removed from Dseqrecord c, the test passes. Introducing extra bases to 5' or 3' ends of Dseqrecords a & b does not produce an error.
I've also implemented it as a colab notebook with the original test case, and a case where the assembled sequence is annotated correctly https://colab.research.google.com/drive/1akdSdrVGu7w5mD2jd7HJ-hD17J4zApJj?usp=sharing
Thank you for creating such a brilliant (and beautifully written) package