Open ojeda-e opened 3 years ago
pandas is slow, use numpy/scipy!
In GridDataFormats we have the option to interpolate on a grid https://github.com/MDAnalysis/GridDataFormats/blob/be6132ac13041390a880061e4e873044b6c29573/gridData/core.py#L311 ; this is not the cleanest code but perhaps useful to initially look at.
Commenting here instead of the PR so that discussion is in one place: I'm not sure that this is a good idea. There is no membrane where the protein is, so there shouldn't be any membrane curvature there either.
Thanks @lilyminium .
I was considering having the interpolation optional and adding a warning message when np.nans are simply too much. I didn't add anything in base.py
yet, this is the function only. Here are three reasons to consider.
1) If there are np.nan
s in the region of the protein, the adjacent values will be also np.nan
s. It means we may lose information around the protein. Yes, in the region of the protein we already lost data. But keeping the nans means the resulting array will add np.nans
to all the adjacent cells. To make this statement visual, here is an example of what happens when that region is not interpolated, and how the calculation of the gradients spread nans. See for instance how a single undefined value ends up in an extended region of np.nan
in the bottom left corner. Same happens in all the cells that delineates the protein.
2) With no interpolation, contours won't happen. plt.contours
doesn't work if the data has np.nans
. It will limit the user to use plt.imshow
.
3) Even in the case when we say, ok, fine, let's go with plt.imshow
instead, the np.nans
remain problematic unless you opt for a None
interpolation. Interpolations are not only more pleasant to read, but also make results clearer. Almost all the interpolation methods are based on gradients. And again, gradient of np.nan
is more np.nan
(annoying np.nans
). To visualize this limitation, let's see the plt.imshow
plot of Gaussian curvature looks like using with different types of interpolation:
Even with the best-behaved interpolations (marked with cyan rectangles), there is a potential loss of information, more or less significant, depending on the interpolation method implemented by plt.imshow
.
In summary, untreated undefined values may result in extended regions of np.nans
. Which ultimately results in a considerable loss of information.
The user will have an option to interpolate the np.nans
. A warning of " You have too many nans and your results will be driven by undefined values, check your input" kind of thing can pop when:
1) the user overestimates the number of n_x_bins
, n_y_bins
in the grid,
or
2) the protein (or any object inserted) occupies too many n_x_bins
, n_y_bins
in the cell (because either the section of the protein in the membrane is huge, or either because the user underestimated the number of n_x_bins
, n_y_bins
in the grid).
But in any case, and in the way I see the solution to calculate curvature, the option of interpolation has to be provided. I was considering an option to mask the protein, but I won't elaborate on it (unless I am asked to) and will leave it as an enahcement.
Thank you for your detailed and clear explanation, @ojeda-e. However, I think I must be missing something, because I don't understand how the protein causes that nan value on the edge. How big is the membrane relative to grid you've illustrated?
I would think that the more flexible solution is to return all the data and let users make their own choices about how they want to interpolate and visualize the grid.
In summary, untreated undefined values may result in extended regions of np.nans. Which ultimately results in a considerable loss of information.
Is this still true with your current code that uses nanmean
to calculate the mean curvatures?
Sorry @lilyminium, my bad for not clarifying.
My intention with the illustration was to show the same effect (spread of undefined values) due to separate effects:
1) The effect of an np.nan
when it is at at the edge (the ones in the squares), and
2) the effect of a group of np.nans
(coming from the embedded protein).
In the first case is easier to identify how a single np.nan
spreads over the calculation of curvature. But 1 and 2 are not correlated in any fashion with those because of the insertion in the membrane (protein). :)
Is this still true with your current code that uses nanmean to calculate the mean curvatures?
You mean, if even by using nanmean we still may have a considerable loss of information?
My answer is: I think yes, it can happen, but it depends on several factors. For example, in systems with extremely short simulation time (low n_frames
== bad idea); (and/or) if the user overestimates bins and during the n_frames
not even one lipid populates. It can even depend on the lipid composition. For example. If I have a membrane with (let's be extra) 90:10 CHOL:PC. We would have only 1/10 of the total number of lipids in the system to use as a reference to derive the surface. Then it will be hard to fill all the cells in the grid (but again, it also depends on how big the unit cells are).
In these cases, there may be space for a case in which a full array is nans. It's unlikely but I think it can happen.
Edit: Maybe we can have some opinions from @MDAnalysis/gsoc-mentors ?
I'm very impressed by your detailed analysis @ojeda-e! In general my approach is that a np.nan
is a np.nan
and so be it. You can give the user the option of using some kind of method to get rid of them with something like a new_data = old_data.interpolate(kernel='gaussian')
. I (try) to view np.nan as a friend rather than an enemy as it stops you getting absolute rubbish and not realising it as you might under some other circumstances.
Thanks, @hmacdope.
I agree with the general approach of np.nan
is np.nan
. However, in the calculation of curvature, there are two serious limitations, that when combined, the problem gets more complicated.
One is sampling. We know that. We know that we need i) a high amount of points to derive a surface, and ii) a high number of frames.
The second is the definition of curvature itself. Curvature is defined by second partial derivatives. As it is in curvature.py
.
The equations are
Now, the fact that we calculate second partial derivatives, means that we calculate gradient twice. In this example, we can see the effect of only one np.nan
in the calculation of first and second gradients.
These are some of the reasons why calculating the curvature of biological membranes can get extremely hard. The formalism asks to have second partial derivatives. Gradients are calculated as differences between adjacent values. Now, add to this rigid formalism that we depend on the lipid composition of the membrane. So, if have a bilayer with 1:1:1 phospholipid_1:phospholipid_2:other_lipid, where other_lipid is any type of lipid that has a flip-flop rate high enough that we have several flip-fllop events in the interval we are averaging, well, we lose 1/3 of the elements used to derive the surface. Now, imagine that, on top of all the above you had a simulation setup small enough that your protein covers lots of unit cells.
You'll get np.nans
. If you don't treat them, the calculated curvature, according to the equations shown, will be meaningless.
In this context, and how I see the problem, having the option to interpolate is necessary. The possibility of abusers of interpolation methods, that's another story. That the user is calculating curvature in a simulation setup that is not suitable to calculate curvature, that's another story too. The user will be more than welcome to turn interpolation off. But it should be offered.
When calculating
get_z_surface
undefined values in bins inside the embedded element (i.e. protein) may arise since noz
coordinates populate bins in the grid. Such undefined values spread in the array during the calculation of curvature.One possible solution is to interpolate the bins occupied by the protein. I would avoid using
pandas
although that approach seems the most straightforward. Checknp.interpolate
.