Edinburgh-Genome-Foundry / DnaWeaver

A route planner for DNA assembly
https://dnaweaver.genomefoundry.org
MIT License
29 stars 9 forks source link

dealing with circular DNA #1

Closed aaroncooper closed 4 years ago

aaroncooper commented 4 years ago

Hey-- this is an awesome library, and thanks for developing and sharing it.

I've been playing around with it to think about how I might be able to use it, and I've been testing it by creating a blast db with a bunch of preexisting vectors. I'm testing by trying to build a new one with a small insert, and it looks like the strategies are going to work. However, the origin of those vectors seems to be causing what I consider to be suboptimal solutions. All solutions are creating a junction at the origin with two PCR products rather than making one product going across the origin.

Is this a known issue? I've also tried making a Golden Gate-compatible version of that vector and loading it into a GoldenGatePartsLibrary, but I still just get the suboptimal PCR product solutions. Do you think I might be doing something wrong?

Zulko commented 4 years ago

Hey, thanks for your interest in DNA Weaver! The library is still undertested for circular sequences afaik, but there should be ways to make it work as you intend.

I believe I understood the issue correctly, and I agree with your approach of providing the origin as a "part" (this I almost equivalent to saying "I have the backbone already"). I am not completely sure why this doesn't work in your example, i.e. why the fragment is not recognized. Do you have a minimal script (ideally with short-ish sequences) to reproduce the issue?

Another and perhaps simpler solution uses the cut_location_constraints parameter to forbid cutting inside a sequence of interest:

uncuttable_sequence = 'CGGGTGCTTGGTGAGACAGCGTATTTTGC'

def cut_location_constraint(sequence):
    """Prevent an assembly station from cutting inside the uncuttable_sequence"""
    index = sequence.find(uncuttable_sequence)
    if index == -1:
        return lambda index: True  # allow any cut
    start, end = index, index + len(uncuttable_sequence)
    return lambda index: not (start <= index <= end)  # refuse cuts on that segment

assembly_station = dw.DnaAssemblyStation(
    name="Gibson Assembly Station",
    assembly_method=dw.GibsonAssemblyMethod(
        overhang_selector=dw.TmSegmentSelector(min_tm=55, max_tm=70),
        min_segment_length=500,
        max_segment_length=4000,
        cut_location_constraints=[cut_location_constraint]  # <=========
    ),
    supplier=[...],
    ...
)

On another note, if you want to check the solutions returned by DnaWeaver, you can use DnaCauldron, which will take your fragments, your enzyme, and will check that the restriction-ligation will produce the expected construct.

import dnacauldron as dc

repository = dc.SequenceRepository(parts={"f1": fragment_1, "part_B":...})
assembly = dc.Type2sRestrictionAssembly(parts_names)
simulation = assembly.simulate(sequence_repository=repository)
assert len(simulation.construct_records) == 1
predicted_construct_record = simulation.construct_records[0]

from geneblocks import  sequences_are_circularly_equal
sequences_are_circularly_equal([predicted_construct_record, my_reference_sequence])
aaroncooper commented 4 years ago

Thanks for the incredibly quick response! I think my issue might be caused by GG fragments not being recognized properly-- I made a test case where the result can be made by GG of just two fragments in the parts library, but the script is suggesting I order synthetic DNA for the whole construct. Any idea what I'm doing wrong? testing_GG.zip

veghp commented 4 years ago

Thank you for bringing attention to this. I can confirm that your fasta sequences do not assemble with https://cuba.genomefoundry.org/simulate_gg_assemblies .

If you convert them to genbank using https://cuba.genomefoundry.org/convert_sequence_files then the assembly works (note that the names need to be shortened to 20 char to fit the genbank format): simulated

I'll look into the DNA Weaver part of the problem. I suspect it may be related to the conversion of fasta to BioPython records.

aaroncooper commented 4 years ago

Hey Peter, thanks for checking this out. Are you saying I need to convert the parts files to GenBank? just wondering how I would import them if so-- looks like the GoldenGatePartsLibrary constructor only allows fasta files.

Zulko commented 4 years ago

@veghp what you are observing with DnaCauldron might be an issue with the mix of upper case and lower case nucleotides in the sequences. It is different from the issue with DNA Weaver.

@aaroncooper The main issue in your files is that the part sequences in your fasta should have a N for the "wildcard" nucleotide, like in this example. For instance the first part sequence should read cgtctcNATGG...TAANgagacg. I know, it's not super practical and could certainly be improved in the future.

Once you fix that, you will see that DNA Weaver is not always great at finding the best strategies for the edges of the sequence, and in particular it may not manage to use a DNA part if it is located at the very beginning of the sequence. This can be solved by asking Weaver to find strategies for different "rotated" versions of your record, which is slower but works. So the end of your script becomes:

best_quote = None
# try successive rotations of the record, every 200bp
for rotation in range(0, len(record), 200):
    rotated_record = record[rotation:] + record[:rotation]
    desired_sequence = dw.SequenceString.from_record(rotated_record, topology='circular')
    assembly_station.prepare_network_on_sequence(desired_sequence)
    quote = assembly_station.get_quote(desired_sequence)
    if quote.accepted and ((best_quote is None) or (quote.price < best_quote.price)):
        best_quote = quote
print (best_quote.assembly_step_summary())

(for memo @veghp : I did that recently for the synbiocad project and it worked fine)

aaroncooper commented 4 years ago

Got it, that makes sense. Representing circular DNA is super annoying, eh?

I actually just got done playing around with this more-- I had surmised that the N might be necessary, as the EMMA example was running fine for me and that was the only difference I could perceive. That does seem to fix the issue! I did run into another bug, very minor, which makes sense-- price of 0 is nonsensical:

Ordering plan (Golden Gate Assembly Station)::                                                                              
  0-722: From GG_parts - price 0.00 - lead_time 0.0 - Part: test_EGFP_GG_source
  722-2649: From GG_parts - price 0.00 - lead_time 0.0 - Part: test_pUC57mini_GG_vector
Price:0.00, total lead_time:0.0
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Price'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dnaweaver_GG_test.py", line 89, in <module>
    report.write_full_report("report.zip")
  File "/usr/local/lib/python3.7/site-packages/dnaweaver/AssemblyPlanReport/mixins/FolderReportMixin.py", line 15, in write_full_report
    self.write_pdf_report(root._file("assembly_report.pdf").open("wb"))
  File "/usr/local/lib/python3.7/site-packages/dnaweaver/AssemblyPlanReport/mixins/PdfReportMixin.py", line 177, in write_pdf_report
    html_out = self.make_html_report()
  File "/usr/local/lib/python3.7/site-packages/dnaweaver/AssemblyPlanReport/mixins/PdfReportMixin.py", line 142, in make_html_report
    cost_orders = orders_dataframe["Price"].sum()
  File "/usr/local/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Price'
root@192ba9ba9de8:~/test/testing_GG# 
aaroncooper commented 4 years ago

Sorry that this has bled into another issue, and I'm happy to open a different one if you think appropriate. I have tried defining a non-zero price_per_part but it is still crashing for this KeyError. Defining a non-zero cost for the GoldenGateAssemblyMethod does not rescue it either.

Zulko commented 4 years ago

Hmm not sure what causes this error I have never seen it before, do you have a minimal example? And ok for a new issue (because closing issues is good for moral :D )

veghp commented 4 years ago

I tried the solution proposed by @Zulko and it works: assembly_report.pdf

One shortcoming is that DNA Weaver will cut the section where the "new" origin is into two orders. This also means it will not recognize parts from your library, for that section.

I did not come across the above KeyError problem. In the new issue can you please confirm which version of DNA Weaver you are using (PyPI or Github)?

Zulko commented 4 years ago

@veghp The assembly report you posted shows a solution that doesn't use the parts library, but with the rotation strategy I suggest above this is solved and the final best_quote uses the parts library.

veghp commented 4 years ago

Yes, sorry. Here is the correct report: assembly_report.pdf

aaroncooper commented 4 years ago

Here's some code and input with the price error. testing_GG.zip

I'm using dnaweaver from PyPI, just installed it a few days ago. I looked back at my docker image build and it seems to be 0.3.4. (I tried looking at the __version__ attribute of dnaweaver in the python interpreter, but it doesn't seem to have that attribute.)

aaroncooper commented 4 years ago

I couldn't help looking into it a bit, and it seems like in PdfReportMixin, orders_dataframe, unlike pcr_dataframe and parts_dataframe, has no explicit column definition at time of dataframe creation. Perhaps adding that would fix this?

Zulko commented 4 years ago

It is strange that it works for both @veghp and I, but not for you. What is your pandas version? Maybe it is a python3.7 thing, or a pandas version problem (do you have pandas >1.0?). In last resort you can also try installing the Github master with pip3 install git+git@github.com:Edinburgh-Genome-Foundry/DnaWeaver.git.

aaroncooper commented 4 years ago

Oh, interesting-- the last example I sent is working for you without the Price key error?

I'm on pandas 1.0.3.

Zulko commented 4 years ago

Sorry, I went too fast. You are right, your example did produce the error you reported, and the fix you suggested also works, and is now merged in master, so if you upgrade your example should work.

pip3 install git+git@github.com:Edinburgh-Genome-Foundry/DnaWeaver.git

Apparently you were the first user to have a case with 0 orders, and wanting a report out of it!

aaroncooper commented 4 years ago

Awesome, thanks! Happy to oblige and break some code :)

I changed my docker build to pull directly from github (unfortunately I have to install git in the first place now, but that's no big deal). It's now working-- thanks for the quick fix.

From earlier in our conversation, sounds like you're thinking of how to work better with circular DNA. I think I can just set the origin on a junction for the time being to work around that, but it'd be fantastic for a future release!

Zulko commented 4 years ago

:+1: Closing this thread. Please report any other bugs/difficulties you find!

For circular DNA, the rotation trick above should work in most cases, but it is still weak in edge cases like "a whole plasmid made only of existing parts", for which a more direct method may be better adapted (e.g. identify the required parts directly using a blast search).