Add pbc conditions to in mapping coordinates to grid.

MDAnalysis / membrane-curvature

MDAnalysis tool to calculate membrane curvature.

https://membrane-curvature.readthedocs.io/

GNU General Public License v3.0

29 stars 6 forks source link

Add pbc conditions to in mapping coordinates to grid. #36

Closed ojeda-e closed 3 years ago

ojeda-e commented 3 years ago

Periodic Boundary Conditions (PBC) are not applied when mapping coordinates to grid (grid_map).

To fix this issue:

[ ] Add PBC conditions to map coordinates.
[ ] Add test using dummy coordinates to map values that exceed by defect or excess the boundaries of the grid.

For example:

System with 9 beads with coordinates of x -1, 0, 1 and y -1, 0 ,1
System with 16 beads with coordinates of x -2, -1, 0, 1, and y -2, -1, 0 ,1, 2.

richardjgowers commented 3 years ago

Try and avoid implementing this yourself, check out lib.distances.apply_PBC

ojeda-e commented 3 years ago

I was considering to apply PBC is in AnalysisBase, in __init__:

        # Apply PBC conditions
        if self.pbc == True:
            self.ag.wrap()
        else:
            warnings.warn(" `PBC == False` may result in inaccurate calculation "
                          "of membrane curvature. Surfaces will be derived from "
                          "a reduced number of atoms.")

I think this is a better approach than applying it in _single_frame or in the get_z_surface function (but I may be wrong). If you are ok with this approach, I can add it with respective tests to the current PR #41 and this issue may get fixed.

What do you think @IAlibay, @lilyminium, @orbeckst, @fiona-naughton?

IAlibay commented 3 years ago

I was considering to apply PBC is in AnalysisBase, in __init__:
        # Apply PBC conditions
        if self.pbc == True:
            self.ag.wrap()
        else:
            warnings.warn(" `PBC == False` may result in inaccurate calculation "
                          "of membrane curvature. Surfaces will be derived from "
                          "a reduced number of atoms.")
I think this is a better approach than applying it in _single_frame or in the get_z_surface function (but I may be wrong). If you are ok with this approach, I can add it with respective tests to the current PR #41 and this issue may get fixed.

What do you think @IAlibay, @lilyminium, @orbeckst, @fiona-naughton?

I might be misunderstanding what you're trying to do here, but isn't ag.wrap() only applied to that timestep as it is being accessed? I.e. when you traverse through the trajectory the wrap() stops being applied right?

orbeckst commented 3 years ago

Following up on https://github.com/MDAnalysis/membrane-curvature/pull/48#issuecomment-881905085 : Before I dive into the code in PR #48 I'd like you to describe your algorithm for solving the PBC problem in words and equations and perhaps pseudo code. State your assumptions about your data (e.g., is the input trajectory raw, centered, rotated, ...; are molecules whole or broken across PBCs, are atoms inside the primary unit cell, ...).

This might not be a trivial problem so it's worthwhile to step back and be clear about what needs to be done. It helps you and your reviewers.

The description in words makes your intent clear and reveals potential logic gaps. Adding equations forces you to be precise about what you really need to do.

ojeda-e commented 3 years ago

Thanks for the comments @orbeckst. Before I go into the pseudo code, and to clarify PR #48, I realized too late that I didn't mention two fundamental points.

Broadly speaking, there are two main types of systems to consider in MembraneCurvature. 1.1 Membrane only. 1.2 Membrane-protein.

And in 1.2, broadly speaking, we can have two possible scenarios, which depend on how the trajectory was performed. 1.2.1 Protein with positions restraints. The protein does not move, does not rotate. 1.2.2 Protein with no position restraints. The protein translates, rotates, etc.

In both cases, 1.2.1 and 1.2.2, although it depends on the choice of n_bins, undefined values will appear, as described in #45

Since MembraneCurvature derives surfaces from AtomGroup, the atoms there included shouldn't be broken. We can fix it with a preprocessing. For example, in gromacs one would process the trajectory with trjconv -pbc whole.

Now, depending on the type of system, and the purpose of the calculation. If the interest is to asses what is the curvature induced by the protein, the most difficult scenario to deal with is 1.2.2, because it would require additional preprocessing. We need to rotate and center the protein. For example, in gromacs, the trajectory would be processed with trjconv -pbc whole + trjconv fit -rot+trans. As a consequence, we will have a system that rotates or "accomodates" around the membrane. I may have not enough expertise, but in 100% of the systems I have worked on in the 1.2.2 category, the resulting system is a rhomboid. And I may be mistaking here, but then in this situation, wrapping atoms wouldn't make sense in the calculation because it must be taken into account in the preprocessing part.

I didn't mean #48 was the complete solution, it was intended to cover the type of PBC conditions for systems of the type 1.1, and 2.2.1.

I have been thinking about the best way to address this limitation. One option is, maybe by adding to __init__ a boolean protein_fit attribute that can takes the input of preprocessed trajectories of 2.2.1 and treats the surface accordingly.

For example,

def __init__(self, universe, select='all',
                 n_x_bins=100, n_y_bins=100,
                 x_range=None,
                 y_range=None,
                 protein_fit=False 
                 pbc=True, **kwargs):

and then do the respective calculation accordingly. Then the treatment wouldn't have wrapped coordinates. I will post the pseudo code in a different comment

orbeckst commented 3 years ago

From the above it wasn't clear to me if you want to have

all atoms inside the primary unitcell (ag.wrap()), OR
all lipid molecules (fragments) whole (ag.unwrap(compound="fragment") if I remember correctly) ?

Either way relies on the PBC code, which cannot deal with rotation-fitted systems. (The problem is not the triclinic shape of the cell — our PBC routines can handle it (it was A LOT of work for @richardjgowers and others to get that right) but rather that the unitcell representation is not rotated with the rotational fit.)

It wasn't clear to me that PR #48 would only deal with a subset of scenarios, namely 1.1 and 1.2.1 (you might want to make an unwrap step optional because it is expensive and if you already have a properly processed trajectory; I think wrap is fairly cheap). A fully processed 1.2.2 (membrane+protein unwrapped (or wrapped?) and rotational fitted is also fine) --- you just use the trajectory as is. You'll get regions that are not covered and your membrane around the region is best cut out as a circle (as you showed us in your presentation) but that's ok. This is how we typically deal with volumetric densities.

I am throwing the following out for discussion (and I might have missed something so please correct me/explain): I know that we stressed the need for dealing with PBC. But looking at your options here perhaps all you can really do is ensure that your input trajectory is already correctly processed: Either using external tools like gmx trjconv -pbc whole -ur compact -c; gmx trjconv -fit rot+transxy or using MDAnalysis on-the-fly transformations (although on-the-fly transformations cannot (yet) make -ur compact boxes). You could leave it to the user to provide properly preprocessed trajectories (at least in the initial version) and then later add code for doing the preprocessing.

lilyminium commented 3 years ago

I know this was not in the original scope, but Irfan and I were chatting about it. What about not preprocessing the original trajectory for rotation and translation but instead defining an atom group (typically the protein) to be the center? Then for each frame, you could obtain the rotation and translation matrix needed to fit the surface to the reference (typically the first frame). Because you could define the protein as the center and take the PBC-tiledmembrane around it, this would eliminate the need for nans.

I don’t think that the molecules must be made whole, as firstly you are averaging the z coordinates, and secondly the typical atom selection I’ve seen in your examples and tests has been one bead anyway.

Cheers, Lily

On Jul 17, 2021, at 3:14 PM, Oliver Beckstein @.***> wrote:

From the above it wasn't clear to me if you want to have

all atoms inside the primary unitcell (ag.wrap()), OR all lipid molecules (fragments) whole (ag.unwrap(compound="fragment") if I remember correctly) ? Either way relies on the PBC code, which cannot deal with rotation-fitted systems. (The problem is not the triclinic shape of the cell — our PBC routines can handle it (it was A LOT of work for @richardjgowers and others to get that right) but rather that the unitcell representation is not rotated with the rotational fit.)

It wasn't clear to me that PR #48 would only deal with a subset of scenarios, namely 1.1 and 1.2.1 (you might want to make an unwrap step optional because it is expensive and if you already have a properly processed trajectory; I think wrap is fairly cheap). A fully processed 1.2.2 (membrane+protein unwrapped (or wrapped?) and rotational fitted is also fine) --- you just use the trajectory as is. You'll get regions that are not covered and your membrane around the region is best cut out as a circle (as you showed us in your presentation) but that's ok. This is how we typically deal with volumetric densities.

I am throwing the following out for discussion (and I might have missed something so please correct me/explain): I know that we stressed the need for dealing with PBC. But looking at your options here perhaps all you can really do is ensure that your input trajectory is already correctly processed: Either using external tools like gmx trjconv -pbc whole -ur compact -c; gmx trjconv -fit rot+transxy or using MDAnalysis on-the-fly transformations (although on-the-fly transformations cannot (yet) make -ur compact boxes). You could leave it to the user to provide properly preprocessed trajectories (at least in the initial version) and then later add code for doing the preprocessing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ojeda-e commented 3 years ago

A quick answer to @orbeckst, what I was thinking was to have all the atoms in the primary unit cell, so using ag.wrap(). Now, let me give more details about the PBC I was considering. I'll go by parts.

Recap from the previous conversation: We can have three possible scenarios when calculating curvature: 1.1 Membrane only. 1.2 Membrane-protein, which can be 1.2.1 Protein with positions restraints. The protein does not move, does not rotate. 1.2.2 Protein with no position restraints. The protein translates, rotates, etc.

For the three possible outcomes, I will start with the two "trivial" ones. 1.1 and 1.2.1. When passing a Universe with a trajectory that has only lipids or lipids with a fixed protein, what we need to guarantee is PBC. Meaning, wrapping the coordinates in order to get those "extra" atoms in the same primary unit cell, as it is shown in the figure:

In this way I guarantee two things:

[x] I have more points to provide as reference to derive the surface. This will strongly improve sampling.
[x] Empty places in the raw trajectory will be populated, and therefore we will avoid np.nans. This means our gradients won't have lots of nans.
[x] Additionally, by adding this PBC, we can address one of the points that were highlighted earlier in the project about how do we do when the membrane is too close to the edge of the simulation box, that we would eventually have the atoms of references split in lower/upper side of the simulation box. (Raised by @lilyminium, suggesting to have negative coordinates in each dimension to run tests in the now called get_z_surface #22 )

In summary, what we need is to wrap coordinate to run MembraneCurvature in the 1.1 and 1.2.1 cases.

Then, how to apply PBC? We use ag.wrap() in base.py. When we initialize, to check if PBC was called "True", and then in _single_frame. Which was as the code submitted in PR #48.

In the scenario in which I am a very distracted user, and I preprocess the trajectory with PBC instead of the raw trajectory, applying coordinate wrapping won't hurt. Unless I am missing a point here, applying ag.wrap() won't hurt.

Now, let's consider the general landscape and skip the details: what we need is either apply PBC / ag.wrap, OR further processing/calculations. However, and in the way I think and conceptualize the problem is: How do we know when to use either way? How do we know that we either apply PBC and that suffices to run MembraneCurvature, or either do we do something different? That first limitation, and again, as I see the problems, goes like this: "hey user, let me know what type of system I am going to help calculating curvature".

So we can initialize knowing what we can use as Universe. I haven't come to a key name for that attribute, but let's say something like protein_fit=False. Then we know that under the condition of having a system like 1.1 or 1.2.1, we don't perform more than coordinate wrapping to run MembraneCurvature().

__init__(self, universe, select='all',
                 n_x_bins=100, n_y_bins=100,
                 x_range=None,
                 y_range=None,
                 protein_fit=False  # alternatively, center_protein=False, or similar.
                 pbc=True, **kwargs):

So far so good? @orbeckst @richardjgowers @IAlibay @lilyminium @fiona-naughton I would like to know we are on the same page, at least for these two "trivial" cases.

I will post a different comment for the treatment of case 1.2.1.

orbeckst commented 3 years ago

Your treatment of the “trivial” cases makes sense to me.

Applying wrapping will only be a problem if it’s a rotationally-fit trajectory so as you say, the problem is how to know about that case. For a start it’s perfectly acceptable to put the onus onto the user to provide correct input and correctly declare it.

Am 7/18/21 um 23:35 schrieb Estefania Barreto-Ojeda @.***>:

A quick answer to @orbeckst, what I was thinking was to have all the atoms in the primary unit cell, so using ag.wrap(). Now, let me give more details about the PBC I was considering. I'll go by parts.

Recap from the previous conversation: We can have three possible scenarios when calculating curvature: 1.1 Membrane only. 1.2 Membrane-protein, which can be 1.2.1 Protein with positions restraints. The protein does not move, does not rotate. 1.2.2 Protein with no position restraints. The protein translates, rotates, etc.

For the three possible outcomes, I will start with the two "trivial" ones. 1.1 and 1.2.1. When passing a Universe with a trajectory that has only lipids or lipids with a fixed protein, what we need to guarantee is PBC. Meaning, wrapping the coordinates in order to get those "extra" atoms in the same primary unit cell, as it is shown in the figure:

In this way I guarantee two things:

I have more points to provide as reference to derive the surface. This will strongly improve sampling. Empty places in the raw trajectory will be populated, and therefore we will avoid np.nans. This means our gradients won't have lots of nans. Additionally, by adding this PBC, we can address one of the points that were highlighted earlier in the project about how do we do when the membrane is too close to the edge of the simulation box, that we would eventually have the atoms of references split in lower/upper side of the simulation box. (Raised by @lilyminium, suggesting to have negative coordinates in each dimension to run tests in the now called get_z_surface #22 ) In summary, what we need is to wrap coordinate to run MembraneCurvature in the 1.1 and 1.2.1 cases.

Then, how to apply PBC? We use ag.wrap() in base.py. When we initialize, to check if PBC was called "True", and then in _single_frame. Which was as the code submitted in PR #48. In the scenario in which I am a very distracted user, and I preprocess the trajectory with PBC instead of the raw trajectory, applying coordinate wrapping won't hurt. Unless I am missing a point here, applying ag.wrap() won't hurt.

Now, let's consider the general landscape and skip the details: what we need is either apply PBC / ag.wrap, OR further processing/calculations. However, and in the way I think and conceptualize the problem is: How do we know when to use either way? How do we know that we either apply PBC and that suffices to run MembraneCurvature, or either do we do something different? That first limitation, and again, as I see the problems, goes like this: "hey user, let me know what type of system I am going to help calculating curvature".

So we can initialize knowing what we can use as Universe. I haven't come to a key name for that attribute, but let's say something like protein_fit=False. Then we know that under the condition of having a system like 1.1 or 1.2.1, we don't perform more than coordinate wrapping to run MembraneCurvature().

init(self, universe, select='all', n_x_bins=100, n_y_bins=100, x_range=None, y_range=None, protein_fit=False # alternatively, center_protein=False, or similar. pbc=True, **kwargs): So far so good? @orbeckst @richardjgowers @IAlibay @lilyminium @fiona-naughton I would like to know we are on the same page, at least for these two "trivial" cases.

I will post a different comment for the treatment of case 1.2.1.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ojeda-e commented 3 years ago

Now, in more detail, let's see the case of systems 1.2.2 Two main points to consider:

In this type of system, the scientific question behind this type of analysis is "What is the curvature induced by the protein?"
While the protein is diffusing, the principal axes of the protein change in every frame, as the sequence below shows.

If we pass the trajectory by applying wrapping coordinates only, then the question stated above can't be answered because the protein diffuses and therefore the output obtained from MembraneCurvature would be meaningless. Hence, we need to treat this system in a particular fashion. Again, in the way I see the problem, there are two options.

I will briefly describe each one, and highlight pros and cons:

- Solution No. 1

Trajectory needs preprocessing. Example, using gromacs, gmx trjconv -pbc whole -ur compact -c, to center the protein gmx trjconv -fit rot+transxy, to move the rest of the system to get a trajectory with the protein as a reference. Important to note, since PBC are applied in Step 1, pbc=False should be passed in __init__. -The resulting trajectory would be passed to calculate MembraneCurvature.
A final clipping would be also needed (step 3 in the figure below). To clarify, clipping is needed only in the very last step for plotting. I just have no handy figure to illustrate.

I thought of additional improvements to this approach in order to avoid clipping and, instead, adding periodic images around the original box, but realized things can get very complicated because it would be box-dependent etc. So I won't elaborate on this unless I am asked to.

- Solution No. 2

Raw trajectory is passed to init.
We have to pass PBC true in order to get all the atoms in the same primary unit cell.
Apply on-the-fly transformations to rotate and translate the atoms of reference with respect to the protein.
We can use the principal axes of the protein (see the first figure) to identify how to rotate the system.
Finally, we rotate and translate the lipids in every frame of the trajectory, with respect to the protein.

So it would look something like this

def _on_the_fly(self):
        self.protein_reference =  #  we would need a reference provided by the user too
        self.ag.wrap() # wrap coordinates  
        self.center_in_box(self.protein_reference) # center protein
        self.translate(self.ag)  # translate 
        self.rotate.rotateby(angle_from_protein_reference, direction = principal_axes_of_protein_reference, ag=self.ag)

Pros: Solution 1: I would think that the user knows what they are doing when preprocessing on gromacs. It would force the user to understand the type of trajectory that is passing (or I would expect so). Solution 2: Completely independent of third parties for preprocessing. Full MDA dependent.

Cons: Solution 1: Clipping when plotting is mandatory. Further improvement is not straightforward (I didn't elaborate on this). Solution 2: I suspect the beginner user will be tempted to pass the raw trajectory and just kind of "boom! let's see what does this gives me as an output" instead of caring about the calculation behind it. (But again, that my perception and also questionable.).

Edit: This is my view of the problem. The first option supports my laziness and makes the user accountable for what is giving as input. The second demands a little bit more work and would fully rely on MDAnalysis.

Any objections here to any of the solutions suggested? @orbeckst @lilyminium @IAlibay @fiona-naughton @richardjgowers Probably I am missing something, mostly due to the fact I've never worked with on_the_fly transformations on MDAnalysis, so suggestions are welcome.

orbeckst commented 3 years ago

Thanks for the detailed explanation — I'd say you have your next blog post already written, including nice illustrations.

I don't see your two solutions as exclusive. I'd initially work with solution 1 (make the user input the preprocessed trajectory) and work on the core parts of your problem. Once that is done, go back and make life easier for users (add solution 2).

orbeckst commented 3 years ago

More detailed comments

We can discuss your ideas to avoid clipping on the next call. My hunch is that this is ultimately related to the problem that we have been mentioning about tracking rotations to the unit cell.
I don't understand why you'd need the principal axes for anything? Why would you not use the rotation/translation fit? Then the protein will be rotated to the frame of reference provided by the reference structure, which is by definition fixed. (Also, in my experience at least ( in particular in some of @fiona-naughton 's work), the principal components are pretty floppy, especially the ones in the X-Y plane, and can easily change so they are not particularly robust for orienting molecules.)
I'd also try OTF-transformations. In principle, they are a very flexible, powerful and unique way to work with MDAnalysis. The User Guide: On the fly transformations with the link to the blog post is a good starting point to learn more.
(EDIT) Even with the normal rotation and PBC wrapping by the OTF transformation, you will still want to clip the circular region in the figure: that's not any different from pre-processing.
(EDIT) The only way that I see for avoiding the clipping is keeping track of the rot/fit transformation and then repacking the system into a new unitcell in the fixed frame of the reference structure. That was, more or less, my suggestion for another GSOC project...

lilyminium commented 3 years ago

Just some feedback on your solutions.

Meaning, wrapping the coordinates in order to get those "extra" atoms in the same primary unit cell, as it is shown in the figure:

If the trajectory is not unwrapped (as is assumed in 1.1 and 1.2.1), I don't think this will do anything. I think you need to consider the edge rows to be continuous with each other, so you don't end up with np.nans in the curvature arrays as is currently implemented. That's what I would consider treating PBC.

Additionally, by adding this PBC, we can address one of the points that were highlighted earlier in the project about how do we do when the membrane is too close to the edge of the simulation box, that we would eventually have the atoms of references split in lower/upper side of the simulation box.

Simply wrapping coordinates on the z-axis won't fix this in the case that the membrane is split between the bottom and the top of the box. In fact, unwrapping would be a better way to deal with it. Can we unwrap only along certain axes @richardjgowers @orbeckst?

Then, how to apply PBC? We use ag.wrap() in base.py.

I don't really think using wrap() will treat PBC. I'm interpreting the PBC problem on the x-y plane to be that you have np.nans on your edges because you only consider the minimum image, when the system is actually continuous. On the z-axis, that the membrane might split over the edge and end up with really weird curvature values. Are we talking about the same thing?

Completely independent of third parties for preprocessing. Full MDA dependent.

I really like this perk of the second solution -- users do often just treat analysis packages as magic boxes. However, I would think that clipping is also necessary for the second solution, as you're just doing the same thing as solution 1 but within MDA.

The only way that I see for avoiding the clipping is keeping track of the rot/fit transformation and then repacking the system into a new unitcell in the fixed frame of the reference structure. That was, more or less, my suggestion for another GSOC project...

In https://github.com/MDAnalysis/membrane-curvature/issues/36#issuecomment-881985238 I somewhat briefly proposed something similar that doesn't need the new implementation of the box -- get the rot+trans matrices of every frame, tile it around the defined center (probably "protein") and shift the surface generated in derive_surface.

lilyminium commented 3 years ago

PS I love your pictures, and flow charts, very clear!

orbeckst commented 3 years ago

Thank you for reminding me about the issues with the gradient, @lilyminium .

@ojeda-e do the "trivial" cases include NPT simulations where the box size fluctuates or are they only NVT?

I'd say that NVT is the only case where the "trivial" case is actually trivial and wrapping atoms into the primary unitcell will be sufficient. Then you get a surface that is well defined on all grid points (except protein) and you can easily calculate a gradient under PBC by wrapping the finite difference around the edges.

In NPT it's already a question how the fluctuating box is handled at the edges (even if you don't have to keep track of rotations) so you'll already get NaNs. If you were to compute gradients for each frame then that would work, except that these gradients would be awfully noisy. If you compute them from the average surface then the surface is not guaranteed to be PBC-continuous (I think) and hence the gradients are not well defined at the edges.

It seems that there are whole lot of different considerations to be made. What is the smallest "unit of problem" here? Can you identify it, work on it, and simply note any other points as a current limitation? Then work on reducing the known limitations.

orbeckst commented 3 years ago

I think @lilyminium 's https://github.com/MDAnalysis/membrane-curvature/issues/36#issuecomment-881985238 idea to manually do the rot+trans fit to obtain the rot/trans matrix M and then apply it to the surface could work, at least for NVT where the histogramming grid stays the same. You also might need an interpolating function when you rotate the discrete grid or figure out an algorithm to rebin correctly (something like the algorithms used for rotating pixel images). In general (NPT), I think, you might have to rotate the system and replicate the coordinates of the lipids and then histogram the PBC-extended coordinates on the fixed grid. That would work in all cases.

lilyminium commented 3 years ago

Sorry, yes, that sentence was very vague and "it" was doing a lot of work. A naive step breakdown of what I was thinking could look something like:

Define a grid around the central protein
- this could be user provided
- or default to half the box-length of the first frame
- or default to half the minimum box-length of the entire trajectory
for each frame, calculate the rotation + translation matrices needed to fit around the protein rotationally + translationally (which MDAnalysis already has tools for)
tile the coordinates of the lipid atoms into a 3x3 grid as below

3b. optionally perform singular value decomposition to get the plane of best fit for each frame, so any protein tilt doesn't make the whole membrane slant across the z-axis (edit: ~this should come before the tiling, it scales horribly~ edit: ~in fact, performing the SVD on the surface of averaged z-coordinates would be better -- this could be the last step.~ edit: or a much simpler way could be to make sure that the edge borders of the averaged z-surface grid are the same.)

apply the transformation matrices to these coordinates
"clip" by binning into the original grid in step 1 around the protein