Closed simone-pignotti closed 4 years ago
You are right, there is no support for circularity at the moment in DNA chisel (or more precisely, there is no support for cross-origin constraints and objectives).
It won't be very easy to add support for circularity (I think I see how it could be done, but it may take some time). Does your problem only involve enzymes sites? In this case I could propose a quick fix.
While codon optimization can be hacked by shifting the sequence so that no cross-origin CDS appears, and the impact of the cross-origin windows on GC content should be negligible, enzymes sites really need this feature. So yes, I would be glad to hear the quick fix!
Ok it was definitely not an easy one, but I believe I have a working solution.
It defines a new specification called AvoidPatternInCircularSequence
which you use the same way as AvoidPattern
, except that it applies to the whole sequence and will look for patterns that are cross-origin.
You can just add this definition at the top of your script and use the spec along any other specs (there is also an example at the bottom).
Let me know if it works for you or if something is unclear.
from dnachisel import AvoidPattern, Location, SpecEvaluation
class AvoidPatternInCircularSequence(AvoidPattern):
"""Flavor of AvoidPattern that considers the sequence circular.
It only applies to a full sequence. The behavior when applied to
only a sequence portions is untested.
It works mostly by adding the beginning of the sequence to its end,
so that cross-origin patterns may emerge.
"""
is_localized = False
right_hand = None
left_hand = None
def evaluate(self, problem):
"""Return score=-number_of_occurences. And patterns locations."""
if self.is_localized:
# When applied locally, find occurences of the pattern.
# All that matters in this section is to find if the pattern
# is present. The exact locations are not important.
sequence = self.location.extract_sequence(problem.sequence)
if self.right_hand:
sequence = sequence + self.right_hand
elif self.left_hand:
sequence = self.left_hand + sequence
locations = self.pattern.find_matches(sequence)
else:
# When applied globally, find all occurences of the pattern
# and their location. if a location is cross-origin, split
# it in two (one at the beginning, one at the end).
sequence = problem.sequence
L = len(sequence)
locations = self.pattern.find_matches(
sequence + sequence[:constraint.pattern.size])
locations = [loc for loc in locations if loc.start < L]
if len(locations) and locations[-1].end > L:
first_half= Location(0, locations[-1].end - L)
last_half = Location(locations[-1].start, L)
locations = [first_half] + locations[:-1] + [last_half]
score = -len(locations)
if score == 0:
message = "Passed. Pattern not found !"
else:
message = "Failed. Pattern found at positions %s" % locations
return SpecEvaluation(
self, problem, score, locations=locations, message=message
)
def localized(self, location, problem=None, with_righthand=True):
new_spec = AvoidPattern.localized(self, location, problem=problem,
with_righthand=with_righthand)
new_spec.is_localized = True
if location.end == len(problem.sequence):
new_spec.right_hand = problem.sequence[:constraint.pattern.size]
if location.start == 0:
new_spec.left_hand = problem.sequence[-constraint.pattern.size:]
return new_spec
# EXAMPLE OF USE:
from dnachisel import random_dna_sequence, DnaOptimizationProblem
sequence = "CTC" + random_dna_sequence(5000, seed=123) + "CGT"
problem = DnaOptimizationProblem(
sequence,
constraints=[
AvoidPatternInCircularSequence("BsmBI_site")
]
)
problem.resolve_constraints()
Thank you very much, I will test it asap!
Hi, after replacing the variable constraint
with self
inside AvoidPatternInCircularSequence
, the example works, thank you!
On the other hand, after the optimization problem.n_mutations
is 2 while 3 mutations occurred. I guess the one on the boundary is not counted.
Hello - great that it works. I am not certain it will always work but the good thing is, if it fails to solve, it will raise an exception, so you will be aware of it. In any case I'll start thinking of a generic way of solving circular sequences (where the DnaOptimizationProblem
has a sequence_is_circular=True
). probably in a week or two.
Regarding problem.n_mutations
, I understand that this parameter might be confusing. It does not refer to the maximal number of mutation, but to the number of simultaneous mutations that will be done at each iteration, when the DNA Chisel algorithm uses a random (directed) search. I'll probably change the name in the future for clarity.
I'll start thinking of a generic way of solving circular sequences
That's great, thanks a lot!
Regarding
problem.n_mutations
, I understand that this parameter might be confusing.
I see, sorry I missed that from the docs but I found it now. I agree it's a bit misleading anyway.
I forgot a pro tip: if you want to minimize the number of mutations, you need to introduce a AvoidChanges
objectives and optimize the sequence after you resolve the constraints:
problem = DnaOptimizationProblem(
constraints=[...],
objectives=[AvoidChanges()]
)
problem.resolve_constraints()
problem.optimize()
The last line's optimization with undo every mutation that is not stricly necessary to solve the constraints.
Yes thank you, I was actually hoping for this to work and I am glad you confirmed that.
I have a first support for circular DNA optimization problems now if you want to test it:
There are no warranties at this stage, but the new class CircularDnaOptimizationProblem (while slower) should work for any kind of specification you throw at it.
I have worked a bit more on this and it seems to work decently. It is in the latest PyPI version. I'll close this for now (please open a new issue if necessary), and I'll keep the n_mutations change in mind.
Hi, I am very sorry for the late reply, I still haven't had a chance to test the CircularDnaOptimizationProblem class! I will as soon as I can though, thanks a lot for implementing this extremely useful feature!
No worries, let me know how it goes when you try.
Hi,
I am using the current master branch (5f5c6cec0d51de2683ec643fae79356f0031821d) and I am having some issues with the AvoidChanges
specification combined with the CircularDnaOptimizationProblem
class. As you suggested, I am using AvoidChanges
to minimize the changes to the sequence (opt), but also to avoid changes completely in specific regions (constr).
I have prepared a test script to reproduce the error that you'll find attached. I tested it calling:
./dnc_test.py -c -s e_coli -m 0.4 -x 0.6 -w 10 -z -f 0.1 -r test_reports test_seq.gb restriction_sites.tsv test_opt.gb
Traceback:
Traceback (most recent call last):
File "./dnc_test.py", line 289, in <module>
main()
File "./dnc_test.py", line 258, in main
project_name=f"Optimization of {os.path.basename(in_fn)}"
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/CircularDnaOptimizationProblem.py", line 141, in optimize_with_report
return problem.optimize_with_report(target, project_name=project_name)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 771, in optimize_with_report
self.optimize()
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 637, in optimize
self.optimize_objective(objective=objective)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 621, in optimize_objective
local_problem.optimize_by_exhaustive_search()
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 498, in optimize_by_exhaustive_search
current_best_score = self.objective_scores_sum()
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 229, in objective_scores_sum
return self.objectives_evaluations().scores_sum()
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 226, in objectives_evaluations
return ProblemObjectivesEvaluations.from_problem(self)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/SpecEvaluation.py", line 407, in from_problem
for specification in problem.objectives
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/SpecEvaluation.py", line 407, in <listcomp>
for specification in problem.objectives
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/builtin_specifications/AvoidChanges.py", line 89, in evaluate
sequences_differences_array(sequence, target)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/biotools/sequences_differences.py", line 13, in sequences_differences_array
"Only use on same-size sequences (%d, %d)" % (len(seq1), len(seq2))
ValueError: Only use on same-size sequences (1, 0)
Without -c
(circular sequence), everything works as expected. Let me know if you can reproduce the issue or if you need some more info. Also, feel free to use snippets from the script in examples for the docs. Thanks in advance, and good job with the latest commits!
I just noticed two more issues:
CircularDnaOptimizationProblem
and DnaOptimizationProblem
, the strand of AvoidPattern
occurrences in the constraint breaches reports is wrong;CircularDnaOptimizationProblem.to_record
returns the original sequence unchangedEDIT: it's actually one issue, the CircularDnaOptimizationProblem.to_record
one. The other one is me being silly.
Regarding the original bug: looks fixable. I think my test cases all involved constraints, but I might have overlooked objective optimization for circular sequences. Could you send a minimum python script reproducing the issue, such as:
import dnachisel as dc
problem = dc.DnaOptimizationProblem(
sequence=dc.load_record('dnc_test.gb')
constraints=[???],
objectives=[???]
)
problem.resolve_constraints()
problem.optimize()
I tried the following, but another bug pops up:
import dnachisel as dc
problem = dc.CircularDnaOptimizationProblem(
sequence=dc.load_record('test_seq.gb'),
constraints=[dc.AvoidPattern(pattern=dc.SequencePattern.DnaNotationPattern('GAAABCC'))],
objectives=[dc.AvoidChanges()]
)
problem.resolve_constraints()
problem.optimize()
Traceback (most recent call last):
File "dnc_test.py", line 50, in <module>
problem.optimize()
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/CircularDnaOptimizationProblem.py", line 131, in optimize
self, with_constraints=True, with_objectives=True
TypeError: _circularized_view() got multiple values for argument 'with_objectives'
The same works fine with DnaOptimizationProblem
.
Anyway, replacing problem.resolve_constraints()
and problem.optimize()
with problem.optimize_with_report(target='report')
:
Traceback (most recent call last):
File "dnc_test.py", line 7, in <module>
problem.optimize_with_report(target='report')
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/CircularDnaOptimizationProblem.py", line 141, in optimize_with_report
return problem.optimize_with_report(target, project_name=project_name)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 771, in optimize_with_report
self.optimize()
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 637, in optimize
self.optimize_objective(objective=objective)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 566, in optimize_objective
evaluation = objective.evaluate(self)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/builtin_specifications/AvoidChanges.py", line 89, in evaluate
sequences_differences_array(sequence, target)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/biotools/sequences_differences.py", line 13, in sequences_differences_array
"Only use on same-size sequences (%d, %d)" % (len(seq1), len(seq2))
ValueError: Only use on same-size sequences (285, 95)
Note that the ValueError
is different from my previous case, where it was:
ValueError: Only use on same-size sequences (1, 0)
No warranties, but can you give it a try now (from the github master)
Thanks a lot for the really quick fix! Looks like it's working now (even the CircularDnaOptimizationProblem.to_record
function). Closing for now, I'll let you know if I find other issues.
Ops, running on a bigger instance results in this error:
Solving constraints
Now optimizing the sequence
Success! Generating report.
Traceback (most recent call last):
File "./dnc.py", line 281, in <module>
main()
File "./dnc.py", line 255, in main
problem.optimize_with_report(target=report_dir)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/CircularDnaOptimizationProblem.py", line 141, in optimize_with_report
return problem.optimize_with_report(target, project_name=project_name)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 774, in optimize_with_report
target, self, project_name=project_name
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/reports/optimization_reports.py", line 264, in write_optimization_report
contract_under=contract_under)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/geneblocks/DiffBlocks.py", line 160, in from_sequences
blocks_in_seqs, remarks = get_optimal_common_blocks(common_blocks)
File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/geneblocks/DiffBlocks.py", line 462, in get_optimal_common_blocks
s1_loc = blocks_in_seqs_dicts['s1'][block_name]['location']
KeyError: 'block_03*'
Exact same code running on the test sequence (attached to the previous comment) runs fine.
Sounds like an error in Geneblocks when plotting the figure in the report, so a different issue. Will look into it.
Thank you! If that helps, the sequence is ~12kb and everything works fine with DnaOptimizationProblem
.
I guess you had some kind of homology in the sequence that caused a pretty niche bug. Can you try to upgrade geneblocks and see if it solves the problem?
pip(3) install --upgrade geneblocks
That solves it, thanks! On the other hand, it breaks CircularDnaOptimizationProblem.to_record
for some reason :/ the output sequence is the original one.
Just another idea for the future: it would be really cool to have a dna_features_viewer.CircularGraphicRecord.plot_with_bokeh
option for reports (the PDF is really hard to browse for long sequences, above all for circular ones where it contains 3 times as many bases)
Are you sure the CircularDnaOptimizationProblem.to_record
is a bug, is it not that the sequence didn't need to be optimized so DNA Chisel didn't do any optimization? Can you provide a minimal example to reproduce the issue?
Sorry couldn't find the time to create a minimal example, but I think it is linked to the optimize_with_report
function because running resolve_constraints
followed by optimize
results in to_record
working. Maybe the optimize_with_report
calls to_record
already, which for some reason modifies the sequence object?
You were right, the optimize_with_report
method was very wrong. I have hopefully fixed it now, please upgrade (pip(3) install --upgrade dnachisel
) and let me know if it works.
Thanks a lot, it works now! This will be the last comment on this very long thread, promised :) I'll take the chance to thank you again for this extremely useful library, and all the others you are developing for EGF and kindly sharing with the world. It's a really interesting ecosystem, and I don't think there is a valid alternative that is so open and hackable at the moment! Congrats
Thanks, much appreciated! Useful and hackable is what I aim at with these projects, so I'm happy you noticed.
Let me know if you run into more trouble with any of these repos. And btw I am spending some time cleaning and improving the code, so there might be some minor api changes in the near future (I'll keep track of changes there).
Hi, Does DnaChisel support circular sequences? I may be missing something, but judging from this line and the behavior on a quick test, I doubt it: https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/5a8cce768f975f22d3e7e83cfcf28d2f803a222c/dnachisel/Location.py#L72-L74 Would it be possible to add support for this? I think it is a very useful feature, above all when removing restriction sites, since they could span over the end/start.
Thank you Simone