UC-Davis-molecular-computing / scadnano-python-package

Python scripting library for generating designs readable by scadnano.
https://scadnano.org
MIT License
13 stars 7 forks source link

Closes #2; support Extensions on the end of a Strand #230

Closed UnHumbleBen closed 2 years ago

UnHumbleBen commented 2 years ago

WIP: #2

@dave-doty I added some unit tests for chain methods for DNA extensions. Can you give feedback on the unit tests written so far. Is the tested behavior desirable? Are there additional tests/behavior you would like me to add?

UnHumbleBen commented 2 years ago

@dave-doty For creating a 5' extension, should the extension chain method only take effect when a domain is added, similar to how cross works? Or should it just add an extension immediately?

https://github.com/UC-Davis-molecular-computing/scadnano-python-package/blob/6a2c10110823f7450df044d13a9746449e1a0907/scadnano/scadnano.py#L2265-L2266

dave-doty commented 2 years ago

For creating a 5' extension, should the extension chain method only take effect when a domain is added, similar to how cross works? Or should it just add an extension immediately?

It should be like move or to and make the extension immediately; in particular, if it is the last substrand, it would be created by having the extension chained method be the last method called, so it would have to create it immediately. (But there's no reason not to also create it immediately if it is the first chained method call.) This should be legit:

design.draw_strand(0,0).move(10).cross(1).move(-5).extension(6)

On the other hand, if it is the first substrand, it's not clear what's the best approach. draw_strand takes a helix and offset as input. It would be awkward if those arguments refer to something other than the first substrand on the strand. Any ideas for how to do this in a clean, readable way that won't be confusing? (And that preserves the behavior we currently have if there are no extensions?)

dave-doty commented 2 years ago

Any ideas for how to do this in a clean, readable way that won't be confusing? (And that preserves the behavior we currently have if there are no extensions?)

After thinking about this for a second, I think the best is to have an optional argument extension_5p_length in draw_strand. To create a 5' extension of length 6, followed by a domain on helix 0 from offset 0 to 9, you'd do this:

design.draw_strand(0, 0, extension_5p_length=6).move(10)

Then maybe the name of the extension chained method should be changed to extension_3p to emphasize that it can only be used to create a 3' extension.

UnHumbleBen commented 2 years ago

Any ideas for how to do this in a clean, readable way that won't be confusing? (And that preserves the behavior we currently have if there are no extensions?)

After thinking about this for a second, I think the best is to have an optional argument extension_5p_length in draw_strand. To create a 5' extension of length 6, followed by a domain on helix 0 from offset 0 to 9, you'd do this:

design.draw_strand(0, 0, extension_5p_length=6).move(10)

Then maybe the name of the extension chained method should be changed to extension_3p to emphasize that it can only be used to create a 3' extension.

I think this is a good idea and I'll proceed with implementing this way.

The only thing I'm unsure about is how to set a default value for the 5' extension relative_offset. Suppose the following snippet happens:

design.draw_strand(0, 0, extension_5p_length=6)

Sure, we can create an Extension object, but what relative_offset should it have? If it were a 3' extension, this problem would be simple, either (1, -1) for forward strand and (-1, 1) for reverse strand. But for the 5' extension, we don't know whether the domain will be created in the forward or reverse direction, so we can't assign a default relative_offset. I suppose we could make the relative_offset field optional and have the empty value denote that the default placement is desired. What do you think @dave-doty?

Edit: Disregard the question above, I just realize that we have always been using the absence of a value to indicate default so there's no need to explicitly figure out what the offset is and just assign None to it and make relative_offset an Optional field

dave-doty commented 2 years ago

Edit: Disregard the question above, I just realize that we have always been using the absence of a value to indicate default so there's no need to explicitly figure out what the offset is and just assign None to it and make relative_offset an Optional field

I'm not sure about this, if you are talking about making the field Extension.relative_offset be of type Optional[Tuple[float, float]].

I don't like allowing a value to be None (i.e., making its type Optional[Something]) unless there's really some sense in which in can lack a value. (For instance, a Strand can lack a name, in which case there's no string it makes sense to assign to Strand.name; it should be assigned None.)

In this case, every Extension should have a relative_offset. The default value is a concrete pair of floats, and every other relative_offset value one could imagine is also a pair of floats, so it doesn't make sense to allow the value to be None.

For internal classes that only we use, I'm happy to break this rule and just be very careful with our own code. But since Extension will be part of the public API, I prefer to give the guarantee that every relative_offset field has a non-None value of type Tuple[float, float], which will help avoid None-reference errors.

The problem you are describing is that in this situation, at the time draw_strand is called, we don't yet know which of the default values to use for relative_offset, since it will depend on the domain yet to be created. I think a way to handle this is to delay creating the extension until the first Domain is created. The StrandBuilder object could store a extension_5p_length field (which can be None to indicate that it was not specified in draw_strand), and if non-None, the first time move or to is called, this field is checked and an extension is created if need be. (And then the field is reset to None so that subsequent calls to move/to do not create extensions.) Also update_to will have to be handled; it could point the domain in the opposite direction as the first to call.

An alternative is to assume the first domain will be a forward domain and set Extension.relative_offset accordingly, and then to check if to or move (or update_to) is called to create a reverse domain, and then if there is an extension, to fix its relative_offset value.

Does that seem like a good approach? A little more complex to implement StrandBuilder, but then it simplifies the public API for Extension.

dave-doty commented 2 years ago

BTW I'm starting to think from this discussion that it would be simpler to do what you originally suggesteed, and have relative_offset be relative to the "rotation frame" of the adjacent domain. I still think it would be confusing for it to be an offset, i.e., a (z,y) coordinate. But perhaps if it were in polar coordinates, i.e., an angle in degrees and a length in nanometers, that would be simpler for a user to think about.

In this case, 0 degrees would mean the extension lies next to the domain on the same helix. (if next to a forward domain for example, to the left of the domain if a 5' extension and to the right of the domain if a 3' extension) 90 degrees would point away from the helix, and 180 degrees would make the extension overlap the adjacent domain.

In this case I think it would be easier to read if it were two fields of type float, one called length and the other called angle. Then we could change the current length field to be called num_bases to differentiate it.

What do you think about that? The default values could be angle = 45 degrees and length = 1 nm.

This has the bonus that we would not need to worry about doing those transformations when moving/copying strands in the web interface; it would be automatic. But we would have to be careful rendering the extension, though hopefully appropriate SVG transform's would make this easy. (i.e., reflect horizontally if it's a 3' extension, and then rotate 180 degrees if the adjacent domain is reverse).

It would also require some clever calculating when drag-moving the end of the extension, since the (z,y) relative coordinates would need to be translated to the "rotated" polar coordinates.

UnHumbleBen commented 2 years ago

I now agree that the type of relative_offset should be not Optional. I confused the JSON representation with the actual Python object. I mistakenly thought that omitting a field in the JSON translates to a None field in the Python object. I now know that is not the case.

I like the idea of using polar coordinates (length, angle) compared to screen coordinates (z,y). As for whether the coordinate should be relative to a fixed frame or rotation frame of the adjacent domain, I'm not sure. From users I suspect it would be easier to think in terms of a fixed coordinate. At the same time, perhaps most users won't need to think about this field and stick with the default or do modifications on the web application. In this case, I think using the "rotation frame" of the adjacent domain is acceptable because web app users and most Python users won't have to worry about this field, and only a few Python users will need to learn the coordinate system. Whether we choose a fixed frame or a relative frame, I think I would prefer to make the field it's own class, like ExtensionRelativeOffset with fields like ExtensionRelativeOffset.length and ExtensionRelativeOffset.angle to improve readability and we can include details about the coordinate system in the class docstring.

dave-doty commented 2 years ago

At the same time, perhaps most users won't need to think about this field and stick with the default or do modifications on the web application.

Actually, the more I think about this, the more I think that Python users who want to set this position themselves would have an easier time if it's relative to the rotation frame, given how more figures such as this tend to look in papers. (The defaults I'm imagining would be close to what you see in many papers that display a single-stranded extension.)

Whether we choose a fixed frame or a relative frame, I think I would prefer to make the field it's own class, like ExtensionRelativeOffset with fields like ExtensionRelativeOffset.length and ExtensionRelativeOffset.angle to improve readability and we can include details about the coordinate system in the class docstring.

This breaks a bit with convention in the Python library. For instance, an insertion is just a Tuple[int, int]. We could have made it its own class, but then it would be more verbose to modify. Definitely the Dart code benefits from making these classes, because it does not have native support for Tuples (there's some tuple package but it's awkward and can't use the simple Python syntax for accessing elements).

However, I think it should be fine to make it its own class, especially if the chained methods can set these properties without the user needing to create an instance of ExtensionRelativeOffset, e.g.:

design.draw_strand(0, 0, extension_angle=60, extension_length=2)\
      .move(10)\
      .extension_3p(angle=30, length=4)

which would make something like this:

image

UnHumbleBen commented 2 years ago

Actually, the more I think about this, the more I think that Python users who want to set this position themselves would have an easier time if it's relative to the rotation frame, given how more figures such as this tend to look in papers. (The defaults I'm imagining would be close to what you see in many papers that display a single-stranded extension.)

Sure sounds good, let's use the rotation frame of the adjacent domain.

However, I think it should be fine to make it its own class, especially if the chained methods can set these properties without the user needing to create an instance of ExtensionRelativeOffset...

Certainly we want to keep the ability to for the chained methods to be concise without forcing the user to create additional classes. The only issue is coming up with good names for the arguments. For draw_strand, the extension_angle argument is clear, but I'm not sure about what is a good name for the length of the offset as extension_length or extension_5p_length is already used to refer to the number of dna bases on the extension. Similarly, for extension_3p, the angle argument is clear, but again I am not sure what is a good name for the length of the offset because length is already used for the same reason above. Actually, in your example code above, I wasn't sure whether length refers to the polar coordinate length or the number of dna bases.

Another thing, given that we are using polar coordinates, should we rename the name of the field as well as the class for the field? Is "offset" still the right word when speaking in terms of polar coordinates? Would something like Extension.endpoint and ExtensionOrientation be more fitting? Or perhaps we can make it more explicitly that the field has to do with how to display the extension on the web app. So something like Extension.display_config and ExtensionDisplayConfig.

Considering the above, my suggestion is to call the field Extension.display_config and it is of type ExtensionDisplayConfig, which has fields ExtensionDisplayConfig.length and ExtensionDisplayConfig.angle.

For the chain methods, we can have a new method with_extension_display_config which takes as arguments length and angle and must be called either after draw_strand with an extension length specified or after extension_3p. E.g.

design.draw_strand(0, 0, extension_5p_length=2)
      .with_extension_display_config(length=0.68, angle=60)
      .move(10)\
      .extension_3p(length=4)
      .with_extension_display_config(length=1.36, angle=30)

I think this is a good balance of conciseness and clarity. Additionally, the dna base "length" and visual display "length" are separated explicitly.

Another thought, since in polar coordinate, the length field is equivalent to the length of the extension, we could make the default polar coordinate length 0.34 * extension length (number of bases), since each base pair is 0.34 nanometers long.

What are your thoughts @dave-doty?

dave-doty commented 2 years ago

Certainly we want to keep the ability to for the chained methods to be concise without forcing the user to create additional classes... what is a good name for the length of the offset as extension_length or extension_5p_length is already used to refer to the number of dna bases on the extension.

Oh yeah, I forgot about that field too.

Now that I've thought about it some more, maybe we should just make it a chained method instead of modifying draw_strand, e.g.,

design.draw_strand(0, 0)
      .with_extension_5p(num_bases=7, length=2.1, angle=60.0)
      .move(10)

Obviously the last two fields have default values of 1.0 and 45.0, so you could do

design.draw_strand(0, 0)
      .with_extension_5p(num_bases=7)
      .move(10)

After this discussion, this last one is now my favorite, particularly since we don't need to modify length and angle after seeing what type of domain is the first domain after the call to move (as we did with relative_offset). I also like that it will look similar to a 3' extension:

design.draw_strand(0, 0)
      .with_extension_5p(num_bases=7, length=2.1, angle=60.0)
      .move(10)
      .cross(1)
      .move(-5)
      .with_extension_3p(num_bases=5, length=2.1, angle=30.0)

What do you think?

UnHumbleBen commented 2 years ago

After this discussion, this last one is now my favorite, particularly since we don't need to modify length and angle after seeing what type of domain is the first domain after the call to move (as we did with relative_offset). I also like that it will look similar to a 3' extension:

design.draw_strand(0, 0)
      .with_extension_5p(num_bases=7, length=2.1, angle=60.0)
      .move(10)
      .cross(1)
      .move(-5)
      .with_extension_3p(num_bases=5, length=2.1, angle=30.0)

What do you think?

I think think this is a great idea. I just realized that the only reason why we even bother modifying draw_strand was because we were using fixed frame rather than relative frame. Now that we are using relative frame, there's no need to modify draw_strand.

In your code, you have with_extension_5p and with_extension_3p, but perhaps the functionality of these two are similar enough that we can just have one with_extension? I do like have _5p and _3p because it's more verbose and explicit, which might simplify the error handling. What are your thoughts on this?

The other thing to decide on is how to store the length and angle in Extension. We were originally thinking of Extension.relative_offset, but what are your thoughts on Extension.endpoint or Extension.display_config, which I proposed in my previous comment. Another possibility is storing the fields separately as Extension.length and Extension.angle. Out of these options, which one do you think is most consistent with the rest of the scadnano data model? Personally, I like Extension.display_config with a type of ExtensionDisplayConfig because it isolates the display configuration into a single field and makes it explicitly that the "length" refers to a display configuration length rather than number of dna bases.

Another suggestion I have in my previous comment is having the default length be 0.34 * num_bases. What are your thoughts on this?

dave-doty commented 2 years ago

In your code, you have with_extension_5p and with_extension_3p, but perhaps the functionality of these two are similar enough that we can just have one with_extension? I do like have _5p and _3p because it's more verbose and explicit, which might simplify the error handling. What are your thoughts on this?

I agree, I like that it's explicit, and it's only 3 extra characters so not cumbersome to type.

I wonder whether we should make two classes Extension5p and Extension3p? What I'm imagining is that you have an generic Extension object and you want to know whether it's on the 5' or 3' end, you have to go digging through the Strand list of domains to figure it out, so it might be nice if the object itself indicated which type it is.

This could also just be implemented as a Boolean field in the single Extension class. This breaks convention with the three types of Modifications, but there I think it makes more sense to have separate types, since ModificationInternal actually has different fields than the other two. In the Extension case, both types of Extension have identical fields, so making two separate types might be overkill.

Thoughts?

Another possibility is storing the fields separately as Extension.length and Extension.angle. Out of these options, which one do you think is most consistent with the rest of the scadnano data model? Personally, I like Extension.display_config with a type of ExtensionDisplayConfig because it isolates the display configuration into a single field and makes it explicitly that the "length" refers to a display configuration length rather than number of dna bases.

This is a good point, it could get confusing since length means something else in Loopout, so I could imagine someone chasing down a bug where they think length means "number of bases".

Still, that's an awful lot of indirection and hierarchy to introduce. I think it's better just to have a flat structure in Extension (with three fields). So perhaps this is better achieved by simply changing the name from length to display_length? That's consistent with, e.g., other display properties such as the connector length on Modification.

(Whew, you're seeing why I've delayed so long on implementing this feature. It seems like there's lots of room to do it poorly.)

dave-doty commented 2 years ago

Another suggestion I have in my previous comment is having the default length be 0.34 * num_bases. What are your thoughts on this?

I can see an argument for this. The main reason I didn't want to do this is that with a fixed length, we can ensure that there's no awkward overlap in the display. With a long enough extension, the extension would display over the top of the adjacent helix. Of course the user could fix this, but it would be nice if the default behavior avoided this.

The other advantage is that it's nice when default values don't depend on surrounding context (another reason I like that we've switched to the convention of (angle, display_length) relative to the adjacent domain, where the default values are absolute and don't require knowing about the adjacent domain). For example, it's awkward to load helices from the JSON format, because many of their default properties (e.g., idx) require all of them to be loaded first before default properties can be set. It's nicer when you can re-hydrate an object from JSON based only on the object's JSON and not on surrounding context.

Of course, if the user has different length extensions and want them to be proportional to length, then they'll have to customize. But many designs have the property that all extensions are the same length, so I see this as being less of a problem.

So my intuition is that the first two advantages outweigh the last.

UnHumbleBen commented 2 years ago

I wonder whether we should make two classes Extension5p and Extension3p? What I'm imagining is that you have an generic Extension object and you want to know whether it's on the 5' or 3' end, you have to go digging through the Strand list of domains to figure it out, so it might be nice if the object itself indicated which type it is.

This could also just be implemented as a Boolean field in the single Extension class. This breaks convention with the three types of Modifications, but there I think it makes more sense to have separate types, since ModificationInternal actually has different fields than the other two. In the Extension case, both types of Extension have identical fields, so making two separate types might be overkill.

Thoughts?

I think the difference between Extension5p and Extension3p would be too small to warrant making a separate class. Having a Boolean field in the Extension class increases the number of invalid states (i.e. having an Extension at the end of the strand.domains with Boolean field set to 5' end) and I'm not sure how beneficial the field would be. I don't envision distinguishing between 3' and 5' extension to be a common scenario so I think it's best to just have a single Extension class with no boolean field to distinguish 3' vs 5'.

I think it's better just to have a flat structure in Extension (with three fields). So perhaps this is better achieved by simply changing the name from length to display_length?

I'm on board with this idea. Perhaps we can change angle to display_angle as well?

Another suggestion I have in my previous comment is having the default length be 0.34 * num_bases. What are your thoughts on this?

I can see an argument for this. The main reason I didn't want to do this is that with a fixed length, we can ensure that there's no awkward overlap in the display. With a long enough extension, the extension would display over the top of the adjacent helix. Of course the user could fix this, but it would be nice if the default behavior avoided this.

The other advantage is that it's nice when default values don't depend on surrounding context (another reason I like that we've switched to the convention of (angle, display_length) relative to the adjacent domain, where the default values are absolute and don't require knowing about the adjacent domain). For example, it's awkward to load helices from the JSON format, because many of their default properties (e.g., idx) require all of them to be loaded first before default properties can be set. It's nicer when you can re-hydrate an object from JSON based only on the object's JSON and not on surrounding context.

Of course, if the user has different length extensions and want them to be proportional to length, then they'll have to customize. But many designs have the property that all extensions are the same length, so I see this as being less of a problem.

So my intuition is that the first two advantages outweigh the last.

Sounds good, let's have a fixed default value of 1.0 for display_length .

dave-doty commented 2 years ago

Perhaps we can change angle to display_angle as well?

Sure, why not.

UnHumbleBen commented 2 years ago

I think the difference between Extension5p and Extension3p would be too small to warrant making a separate class. Having a Boolean field in the Extension class increases the number of invalid states (i.e. having an Extension at the end of the strand.domains with Boolean field set to 5' end) and I'm not sure how beneficial the field would be. I don't envision distinguishing between 3' and 5' extension to be a common scenario so I think it's best to just have a single Extension class with no boolean field to distinguish 3' vs 5'.

@dave-doty Just to confirm, having one Extension class with no boolean field to distinguish between 5' vs 3' sounds good to you? I could start implementing this way and then add boolean field later if it turns out 5' vs 3' shows up a lot.

dave-doty commented 2 years ago

Sounds good.

On Tue, May 10, 2022, 5:58 PM Benjamin Lee @.***> wrote:

I think the difference between Extension5p and Extension3p would be too small to warrant making a separate class. Having a Boolean field in the Extension class increases the number of invalid states (i.e. having an Extension at the end of the strand.domains with Boolean field set to 5' end) and I'm not sure how beneficial the field would be. I don't envision distinguishing between 3' and 5' extension to be a common scenario so I think it's best to just have a single Extension class with no boolean field to distinguish 3' vs 5'.

@dave-doty https://github.com/dave-doty Just to confirm, having one Extension class with no boolean field to distinguish between 5' vs 3' sounds good to you? I could start implementing this way and then add boolean field later if it turns out 5' vs 3' shows up a lot.

— Reply to this email directly, view it on GitHub https://github.com/UC-Davis-molecular-computing/scadnano-python-package/pull/230#issuecomment-1122646865, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETBU7PEOFYKLYMAUESRTF3VJKISJANCNFSM5TPPLAWA . You are receiving this because you were mentioned.Message ID: <UC-Davis-molecular-computing/scadnano-python-package/pull/230/c1122646865 @github.com>

UnHumbleBen commented 2 years ago

@dave-doty Almost finished implementation in the Python library. Just wanted to double check a few things with you. I have 3 test cases for actions that I think should be disallowed and wanted to see if you agree.

https://github.com/UC-Davis-molecular-computing/scadnano-python-package/blob/06fe22272f217f3f15b2c0c51b61f6fcf57adc14/tests/scadnano_tests.py#L3271-L3280

https://github.com/UC-Davis-molecular-computing/scadnano-python-package/blob/06fe22272f217f3f15b2c0c51b61f6fcf57adc14/tests/scadnano_tests.py#L3361-L3380

https://github.com/UC-Davis-molecular-computing/scadnano-python-package/blob/06fe22272f217f3f15b2c0c51b61f6fcf57adc14/tests/scadnano_tests.py#L3429-L3446

Essentially, should ligate, crossover, and half-crossover be disallowed if applied to the domain on the side where the extension is located.

I also have test cases for normal cases for ligate, crossover, and nick. All the error cases and normal cases begin here, if you want to take a look:

https://github.com/UC-Davis-molecular-computing/scadnano-python-package/blob/06fe22272f217f3f15b2c0c51b61f6fcf57adc14/tests/scadnano_tests.py#L3271

dave-doty commented 2 years ago

Yes, I agree those cases should all be errors.

UnHumbleBen commented 2 years ago

Discovered something strange about ligate while implementing the ligate error checking. There's no check on Python that the domains to ligate are at the end of the strand. For example something like this: image Clicking on helix 0, offset 15/16 should not be allowed. This check is done in the web app (nothing happens when I click there on ligate mode). However, this check is not implemented in the Python library:

Test case:

    def test_ligate_middle(self) -> None:
        """
        [-----+[----->
              |
        <-----+
        """
        design: sc.Design = sc.Design(helices=[sc.Helix(max_offset=100), sc.Helix(max_offset=100)])
        design.draw_strand(0, 0).to(10).cross(1).to(0)
        design.draw_strand(0, 10).to(20)

        design.ligate(0, 10, True)

Debugging design, I see the resulting strand object contains two overlapping domains, which is nonsensical:

 "strands": [
    {
      "color": "#f74308",
      "domains": [
        {"helix": 0, "forward": true, "start": 0, "end": 10},
        {"helix": 0, "forward": true, "start": 0, "end": 20}
      ]
    }
  ]

Is there some work that needs to be done with the existing ligate function to prevent this?

UnHumbleBen commented 2 years ago

I notice a similar problem with crossovers:

In the example here, adding a crossover between 1 and 2 at offset 0 should not be allowed. The web app prevents this (because of the modify crossover feature) but there are no checks in the python code:


    def test_add_half_crossover_middle(self) -> None:
        """
        0  +------]
           |
        1  +------>

        2  <------]
        """
        # Setup
        design: sc.Design = sc.Design(
            helices=[sc.Helix(max_offset=100), sc.Helix(max_offset=100), sc.Helix(max_offset=100)]
        )
        design.draw_strand(0, 10).to(0).cross(1).to(10)
        design.draw_strand(2, 10).to(0)

        design.add_half_crossover(1, 2, 0, True)

The code should throw an error but instead, we get a new design with strands:

  "strands": [
    {
      "color": "#57bb00",
      "domains": [
        {"helix": 2, "forward": false, "start": 0, "end": 10},
        {"helix": 0, "forward": false, "start": 0, "end": 10},
        {"helix": 1, "forward": true, "start": 0, "end": 10}
      ]
    }
  ]

which visually looks like:

image

which isn't right either. Am I using these functions incorrectly or are there actual bugs?

dave-doty commented 2 years ago

Is there some work that needs to be done with the existing ligate function to prevent this?

Definitely looks like an error to me. Do you mind adding unit tests to ensure it raises an exception?

dave-doty commented 2 years ago

Am I using these functions incorrectly or are there actual bugs?

This also looks like a bug.

UnHumbleBen commented 2 years ago

@dave-doty Changes are ready to merge to dev now

dave-doty commented 2 years ago

Can you write up some simple release notes and put them as a comment in the related issue #2? That way we can copy/paste them into the release notes when this is merged to main.