Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
213 stars 38 forks source link

Support for circular sequences #11

Closed simone-pignotti closed 4 years ago

simone-pignotti commented 4 years ago

Hi, Does DnaChisel support circular sequences? I may be missing something, but judging from this line and the behavior on a quick test, I doubt it: https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/5a8cce768f975f22d3e7e83cfcf28d2f803a222c/dnachisel/Location.py#L72-L74 Would it be possible to add support for this? I think it is a very useful feature, above all when removing restriction sites, since they could span over the end/start.

Thank you Simone

Zulko commented 4 years ago

You are right, there is no support for circularity at the moment in DNA chisel (or more precisely, there is no support for cross-origin constraints and objectives).

It won't be very easy to add support for circularity (I think I see how it could be done, but it may take some time). Does your problem only involve enzymes sites? In this case I could propose a quick fix.

simone-pignotti commented 4 years ago

While codon optimization can be hacked by shifting the sequence so that no cross-origin CDS appears, and the impact of the cross-origin windows on GC content should be negligible, enzymes sites really need this feature. So yes, I would be glad to hear the quick fix!

Zulko commented 4 years ago

Ok it was definitely not an easy one, but I believe I have a working solution.

It defines a new specification called AvoidPatternInCircularSequence which you use the same way as AvoidPattern, except that it applies to the whole sequence and will look for patterns that are cross-origin.

You can just add this definition at the top of your script and use the spec along any other specs (there is also an example at the bottom).

Let me know if it works for you or if something is unclear.

from dnachisel import AvoidPattern, Location, SpecEvaluation

class AvoidPatternInCircularSequence(AvoidPattern):
    """Flavor of AvoidPattern that considers the sequence circular.

    It only applies to a full sequence. The behavior when applied to
    only a sequence portions is untested.

    It works mostly by adding the beginning of the sequence to its end,
    so that cross-origin patterns may emerge.
    """

    is_localized = False
    right_hand = None
    left_hand = None

    def evaluate(self, problem):
        """Return score=-number_of_occurences. And patterns locations."""

        if self.is_localized:
            # When applied locally, find occurences of the pattern.
            # All that matters in this section is to find if the pattern
            # is present. The exact locations are not important.
            sequence = self.location.extract_sequence(problem.sequence)
            if self.right_hand:
                sequence = sequence + self.right_hand
            elif self.left_hand:
                sequence = self.left_hand + sequence
            locations = self.pattern.find_matches(sequence)
        else:
            # When applied globally, find all occurences of the pattern
            # and their location. if a location is cross-origin, split
            # it in two (one at the beginning, one at the end).
            sequence = problem.sequence
            L = len(sequence)
            locations = self.pattern.find_matches(
                sequence + sequence[:constraint.pattern.size])
            locations = [loc for loc in locations if loc.start < L]
            if len(locations) and locations[-1].end > L:
                first_half=  Location(0, locations[-1].end - L)
                last_half = Location(locations[-1].start, L)
                locations = [first_half] + locations[:-1] + [last_half]

        score = -len(locations)
        if score == 0:
            message = "Passed. Pattern not found !"
        else:
            message = "Failed. Pattern found at positions %s" % locations
        return SpecEvaluation(
            self, problem, score, locations=locations, message=message
        )

    def localized(self, location, problem=None, with_righthand=True):
        new_spec = AvoidPattern.localized(self, location, problem=problem,
                                          with_righthand=with_righthand)
        new_spec.is_localized = True
        if location.end == len(problem.sequence):
            new_spec.right_hand = problem.sequence[:constraint.pattern.size]
        if location.start == 0:
            new_spec.left_hand = problem.sequence[-constraint.pattern.size:]
        return new_spec

# EXAMPLE OF USE:

from dnachisel import random_dna_sequence, DnaOptimizationProblem

sequence = "CTC" + random_dna_sequence(5000, seed=123) + "CGT"
problem = DnaOptimizationProblem(
    sequence,
    constraints=[
        AvoidPatternInCircularSequence("BsmBI_site")
    ]
)
problem.resolve_constraints()
simone-pignotti commented 4 years ago

Thank you very much, I will test it asap!

simone-pignotti commented 4 years ago

Hi, after replacing the variable constraint with self inside AvoidPatternInCircularSequence, the example works, thank you! On the other hand, after the optimization problem.n_mutations is 2 while 3 mutations occurred. I guess the one on the boundary is not counted.

Zulko commented 4 years ago

Hello - great that it works. I am not certain it will always work but the good thing is, if it fails to solve, it will raise an exception, so you will be aware of it. In any case I'll start thinking of a generic way of solving circular sequences (where the DnaOptimizationProblem has a sequence_is_circular=True). probably in a week or two.

Regarding problem.n_mutations, I understand that this parameter might be confusing. It does not refer to the maximal number of mutation, but to the number of simultaneous mutations that will be done at each iteration, when the DNA Chisel algorithm uses a random (directed) search. I'll probably change the name in the future for clarity.

simone-pignotti commented 4 years ago

I'll start thinking of a generic way of solving circular sequences

That's great, thanks a lot!

Regarding problem.n_mutations, I understand that this parameter might be confusing.

I see, sorry I missed that from the docs but I found it now. I agree it's a bit misleading anyway.

Zulko commented 4 years ago

I forgot a pro tip: if you want to minimize the number of mutations, you need to introduce a AvoidChanges objectives and optimize the sequence after you resolve the constraints:

problem = DnaOptimizationProblem(
    constraints=[...],
    objectives=[AvoidChanges()]
)
problem.resolve_constraints()
problem.optimize()

The last line's optimization with undo every mutation that is not stricly necessary to solve the constraints.

simone-pignotti commented 4 years ago

Yes thank you, I was actually hoping for this to work and I am glad you confirmed that.

Zulko commented 4 years ago

I have a first support for circular DNA optimization problems now if you want to test it:

https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/master/examples/example_with_circular_sequence.py

There are no warranties at this stage, but the new class CircularDnaOptimizationProblem (while slower) should work for any kind of specification you throw at it.

Zulko commented 4 years ago

I have worked a bit more on this and it seems to work decently. It is in the latest PyPI version. I'll close this for now (please open a new issue if necessary), and I'll keep the n_mutations change in mind.

simone-pignotti commented 4 years ago

Hi, I am very sorry for the late reply, I still haven't had a chance to test the CircularDnaOptimizationProblem class! I will as soon as I can though, thanks a lot for implementing this extremely useful feature!

Zulko commented 4 years ago

No worries, let me know how it goes when you try.

simone-pignotti commented 4 years ago

Hi,

I am using the current master branch (5f5c6cec0d51de2683ec643fae79356f0031821d) and I am having some issues with the AvoidChanges specification combined with the CircularDnaOptimizationProblem class. As you suggested, I am using AvoidChanges to minimize the changes to the sequence (opt), but also to avoid changes completely in specific regions (constr). I have prepared a test script to reproduce the error that you'll find attached. I tested it calling:

./dnc_test.py -c -s e_coli -m 0.4 -x 0.6 -w 10 -z -f 0.1 -r test_reports test_seq.gb restriction_sites.tsv test_opt.gb

Traceback:

Traceback (most recent call last):
  File "./dnc_test.py", line 289, in <module>
    main()
  File "./dnc_test.py", line 258, in main
    project_name=f"Optimization of {os.path.basename(in_fn)}"
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/CircularDnaOptimizationProblem.py", line 141, in optimize_with_report
    return problem.optimize_with_report(target, project_name=project_name)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 771, in optimize_with_report
    self.optimize()
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 637, in optimize
    self.optimize_objective(objective=objective)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 621, in optimize_objective
    local_problem.optimize_by_exhaustive_search()
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 498, in optimize_by_exhaustive_search
    current_best_score = self.objective_scores_sum()
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 229, in objective_scores_sum
    return self.objectives_evaluations().scores_sum()
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 226, in objectives_evaluations
    return ProblemObjectivesEvaluations.from_problem(self)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/SpecEvaluation.py", line 407, in from_problem
    for specification in problem.objectives
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/SpecEvaluation.py", line 407, in <listcomp>
    for specification in problem.objectives
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/builtin_specifications/AvoidChanges.py", line 89, in evaluate
    sequences_differences_array(sequence, target)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/biotools/sequences_differences.py", line 13, in sequences_differences_array
    "Only use on same-size sequences (%d, %d)" % (len(seq1), len(seq2))
ValueError: Only use on same-size sequences (1, 0)

Without -c (circular sequence), everything works as expected. Let me know if you can reproduce the issue or if you need some more info. Also, feel free to use snippets from the script in examples for the docs. Thanks in advance, and good job with the latest commits!

test_dnachisel_ac.tar.gz

simone-pignotti commented 4 years ago

I just noticed two more issues:

EDIT: it's actually one issue, the CircularDnaOptimizationProblem.to_record one. The other one is me being silly.

Zulko commented 4 years ago

Regarding the original bug: looks fixable. I think my test cases all involved constraints, but I might have overlooked objective optimization for circular sequences. Could you send a minimum python script reproducing the issue, such as:

import dnachisel as dc
problem = dc.DnaOptimizationProblem(
    sequence=dc.load_record('dnc_test.gb')
    constraints=[???],
    objectives=[???]
)
problem.resolve_constraints()
problem.optimize()
simone-pignotti commented 4 years ago

I tried the following, but another bug pops up:

import dnachisel as dc
problem = dc.CircularDnaOptimizationProblem(
    sequence=dc.load_record('test_seq.gb'),
    constraints=[dc.AvoidPattern(pattern=dc.SequencePattern.DnaNotationPattern('GAAABCC'))],
    objectives=[dc.AvoidChanges()]
)
problem.resolve_constraints()
problem.optimize()
Traceback (most recent call last):
  File "dnc_test.py", line 50, in <module>
    problem.optimize()
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/CircularDnaOptimizationProblem.py", line 131, in optimize
    self, with_constraints=True, with_objectives=True
TypeError: _circularized_view() got multiple values for argument 'with_objectives'

The same works fine with DnaOptimizationProblem.

Anyway, replacing problem.resolve_constraints() and problem.optimize() with problem.optimize_with_report(target='report') :

Traceback (most recent call last):
  File "dnc_test.py", line 7, in <module>
    problem.optimize_with_report(target='report')
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/CircularDnaOptimizationProblem.py", line 141, in optimize_with_report
    return problem.optimize_with_report(target, project_name=project_name)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 771, in optimize_with_report
    self.optimize()
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 637, in optimize
    self.optimize_objective(objective=objective)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 566, in optimize_objective
    evaluation = objective.evaluate(self)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/builtin_specifications/AvoidChanges.py", line 89, in evaluate
    sequences_differences_array(sequence, target)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/biotools/sequences_differences.py", line 13, in sequences_differences_array
    "Only use on same-size sequences (%d, %d)" % (len(seq1), len(seq2))
ValueError: Only use on same-size sequences (285, 95)

Note that the ValueError is different from my previous case, where it was:

ValueError: Only use on same-size sequences (1, 0)
Zulko commented 4 years ago

No warranties, but can you give it a try now (from the github master)

simone-pignotti commented 4 years ago

Thanks a lot for the really quick fix! Looks like it's working now (even the CircularDnaOptimizationProblem.to_record function). Closing for now, I'll let you know if I find other issues.

simone-pignotti commented 4 years ago

Ops, running on a bigger instance results in this error:

Solving constraints
Now optimizing the sequence
Success! Generating report.
Traceback (most recent call last):
  File "./dnc.py", line 281, in <module>
    main()
  File "./dnc.py", line 255, in main
    problem.optimize_with_report(target=report_dir)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/CircularDnaOptimizationProblem.py", line 141, in optimize_with_report
    return problem.optimize_with_report(target, project_name=project_name)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/DnaOptimizationProblem.py", line 774, in optimize_with_report
    target, self, project_name=project_name
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/dnachisel/reports/optimization_reports.py", line 264, in write_optimization_report
    contract_under=contract_under)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/geneblocks/DiffBlocks.py", line 160, in from_sequences
    blocks_in_seqs, remarks = get_optimal_common_blocks(common_blocks)
  File "/Users/simonepignotti/.conda/envs/dnachisel/lib/python3.7/site-packages/geneblocks/DiffBlocks.py", line 462, in get_optimal_common_blocks
    s1_loc = blocks_in_seqs_dicts['s1'][block_name]['location']
KeyError: 'block_03*'

Exact same code running on the test sequence (attached to the previous comment) runs fine.

Zulko commented 4 years ago

Sounds like an error in Geneblocks when plotting the figure in the report, so a different issue. Will look into it.

simone-pignotti commented 4 years ago

Thank you! If that helps, the sequence is ~12kb and everything works fine with DnaOptimizationProblem.

Zulko commented 4 years ago

I guess you had some kind of homology in the sequence that caused a pretty niche bug. Can you try to upgrade geneblocks and see if it solves the problem?

pip(3) install --upgrade geneblocks
simone-pignotti commented 4 years ago

That solves it, thanks! On the other hand, it breaks CircularDnaOptimizationProblem.to_record for some reason :/ the output sequence is the original one.

simone-pignotti commented 4 years ago

Just another idea for the future: it would be really cool to have a dna_features_viewer.CircularGraphicRecord.plot_with_bokeh option for reports (the PDF is really hard to browse for long sequences, above all for circular ones where it contains 3 times as many bases)

Zulko commented 4 years ago

Are you sure the CircularDnaOptimizationProblem.to_record is a bug, is it not that the sequence didn't need to be optimized so DNA Chisel didn't do any optimization? Can you provide a minimal example to reproduce the issue?

simone-pignotti commented 4 years ago

Sorry couldn't find the time to create a minimal example, but I think it is linked to the optimize_with_report function because running resolve_constraints followed by optimize results in to_record working. Maybe the optimize_with_report calls to_record already, which for some reason modifies the sequence object?

Zulko commented 4 years ago

You were right, the optimize_with_report method was very wrong. I have hopefully fixed it now, please upgrade (pip(3) install --upgrade dnachisel) and let me know if it works.

simone-pignotti commented 4 years ago

Thanks a lot, it works now! This will be the last comment on this very long thread, promised :) I'll take the chance to thank you again for this extremely useful library, and all the others you are developing for EGF and kindly sharing with the world. It's a really interesting ecosystem, and I don't think there is a valid alternative that is so open and hackable at the moment! Congrats

Zulko commented 4 years ago

Thanks, much appreciated! Useful and hackable is what I aim at with these projects, so I'm happy you noticed.

Let me know if you run into more trouble with any of these repos. And btw I am spending some time cleaning and improving the code, so there might be some minor api changes in the near future (I'll keep track of changes there).