Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
213 stars 38 forks source link

Issue with AvoidPattern (and potentially other builtin_specifications) #60

Closed deto closed 2 years ago

deto commented 2 years ago

It looks as though AvoidPattern doesn't preserve its strand information when being localized if the original strand information is 'from_location'

https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/db0b606ec993a509de9f3bb4d8acde9baadeef99/dnachisel/builtin_specifications/AvoidPattern.py#L85

Example illustrating this:

ipdb> cst
    # AvoidPattern[4-1624(-)](pattern:AACAAAT)

ipdb> location
    # 9-10

ipdb> new_cst = cst.localized(location, problem=self)
ipdb> new_cst
    # AvoidPattern[4-16](pattern:AACAAAT)

ipdb> new_cst.location
    # 4-16

ipdb> new_cst.strand
    #  'from_location'

The from_location is strand argument is preserved, but the original strand in the location is lost, so the resulting constraint loses the strandedness. I ran into this today when my optimization kept failing and the violated constraint was being reported on the positive strand even though that constraint should only be on the negative.

This was in version 3.2.6 (though the current code looks unchanged)

To fix this in the short term, it looks like I can specify the strand directly with its own argument when creating the constraint (let me know if this is not good for some reason).

This tool is very useful, thank you for the continued development!

deto commented 2 years ago

Actually, specifying strand when creating the AvoidPattern spec doesn't fix this either. The localized version doesn't propagate this to its 'location' property.

>> new_cst.strand
    # -1
>> new_cst.location.strand
   # 0
veghp commented 2 years ago

Thank you, will try and look into this issue during the week.

(may be related to or to be reviewed together with https://github.com/Edinburgh-Genome-Foundry/DnaChisel/issues/53 )

Zulko commented 2 years ago

Yikes, never got bitten by this one. Having a -1 strand replaced by 0 is better than the other way round (it goes from less to ore conservative) but I understand how it can be blocking in some scenarios. For that particular spec, it looks like it could just be a matter or adding one line after l95 to say new_location.strand = location.strand.

Zulko commented 2 years ago

@veghp do you want help with that one? Happy to look into it.

veghp commented 2 years ago

That would be great help, yes. There were some delays in other projects, so didn't have time to look into this.

Zulko commented 2 years ago

So I wasn't able to reproduce the bug, using the repo's master which should correspond to release 3.2.7 on PyPI. The strand of the location does get transferred to its localized version, suggesting that this may have been fixed in this release :tada: . Here is what I ran (note that even the coordinates of the localized location have changed in this new release, from 4-16 to 4-15, but as long as the battery of tests is passing I am confident there is no problem here).

Could you try again with 3.2.7 and let us know if that solves your problems?

from dnachisel import AvoidPattern, Location
cst =AvoidPattern("AACAAT", Location(4, 1624, -1))
location = Location(9, 10)
new_cst = cst.localized(location)

print ("Original CST:", cst)
print ("Original Strand:", cst.strand)
print ("Localized to:", location)
print ("New CST:", new_cst)
print ("New CST strand:", new_cst.strand)

Result:

Original CST: AvoidPattern[4-1624(-)](pattern:AACAAT)
Original Strand: from_location
Localized to: 9-10
New CST: AvoidPattern[4-15(-)](pattern:AACAAT)
New CST strand: from_location
veghp commented 2 years ago

Thanks, @Zulko . The only location-specific change in the new release was this, but I can confirm I get the same result as you with v3.2.6, so it's not related to the release.

@deto could you please clarify the problem=self parameter in your example, or provide a reproducible example?

@Zulko what is the (planned) purpose of problem= in https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/db0b606ec993a509de9f3bb4d8acde9baadeef99/dnachisel/builtin_specifications/AvoidPattern.py#L85 as it doesn't seem to be used in the function body?

Zulko commented 2 years ago

Not on my computer right now but from memory, localized() has the same interface in all Specifications, so it has this problem parameter because some other specs need a problem parameter

deto commented 2 years ago

I was scratching my head at this as, at first, when I tried to reproduce I also wasn't seeing it.

However, it seems to depend on the length of the pattern, strangely.

This code shows the problem:

from dnachisel import AvoidPattern
from dnachisel import Location

cst = AvoidPattern("AACAAAT", location=(4, 1624, -1))

location = Location(9, 10)

new_cst = cst.localized(location)

new_cst.location.strand  # shows 0, should be -1

Running this produces the erroneous unstranded localized AvoidPattern. However, if you remove the final 'T' from the pattern, then it works as expected (new_cst.location.strand = -1)

I did some investigating to figure out why as this is very strange.

The reason is that the extended_location that's created in AvoidPattern.localized will start at 3 if the pattern is of length 7, but will stay at 4 if the pattern is at length 6. This is a problem because inside of Location.overlap_region:

https://github.com/Edinburgh-Genome-Foundry/DnaChisel/blob/db0b606ec993a509de9f3bb4d8acde9baadeef99/dnachisel/Location.py#L41

def overlap_region(self, other_location):
        """Return the overlap span between two locations (None if None)."""
        strand = self.strand
        if other_location.start < self.start:
            self, other_location = other_location, self

        if other_location.start < self.end:
            start = other_location.start
            end = min(self.end, other_location.end)
            strand = self.strand
            return Location(start, end, strand)
        else:
            return None

The first if statement swaps the two locations based on their relative start position. However, regardless of whether the locations are swapped, the new strand is taken from whatever location ends up in the self variable. The correct behavior probably should be to take the strand definition from the original self regardless of whether the locations are swapped. Or, potentially, to evaluate the true overlap (in which case, a positive strand location overlapped with a negative strand location would produce an empty result)

Zulko commented 2 years ago

:facepalm: sorry I didn't realize I was using a different pattern. I can reproduce it now, and thanks for the rootcausing, looking into it.

Zulko commented 2 years ago

I guess the real root cause here was that I wrote overlap_region in a very confusing way and that bit us. This is (hopefully) solved via #61. Leaving it to you @veghp for review and release :+1:

veghp commented 2 years ago

Thank you very much, I'll take over from here and have a look tomorrow, Friday latest.

As for workflow, I just develop in dev then bump version when all issues are addressed, and merge with master, then release it. (The reason is to have only 1 codebase for a given version number in master.) I'll write this up in a 'contribute' document or something as it may be useful to potential contributors.

veghp commented 2 years ago

Thanks @Zulko , it looks good to me (I added a small comment https://github.com/Edinburgh-Genome-Foundry/DnaChisel/pull/61). I'll try and address https://github.com/Edinburgh-Genome-Foundry/DnaChisel/issues/53 too and release on Monday.

veghp commented 2 years ago

Fixed in latest release: https://github.com/Edinburgh-Genome-Foundry/DnaChisel/releases/tag/v3.2.8