jbarnoud commented 8 years ago

On-the-fly coordinate transformations

So far, the only transformation MDAnalysis does to coordinates read from a trajectory is a unit conversion. Any other transformation must be done by th user on a frame per frame basis. Yet, in some use case, the user does not directly access the frame, and therefore cannot apply the transformations. This is especially the case with analyses and visualizations. Hereby, I propose a general mechanism to declare coordinate transformations that will be applied by the reader.

Use cases

The main use cases for the proposal are analyses and visualizations that require structure alignements or periodic boundary corrections.

Analyses such as RMSD calculations require the structures to be aligned. So far, the RMSD class control the fit with a keyword, and only implement a single way of doing that fit. This means that if the user wants to do a different fit, he must save an intermediate trajectory. It also means an other analysis that require a fit needs to implement the way to handle it; such implementation being redundant and potentially inconsistent with the other analyses.

@dotsdl posted recently a blog post on nglview. This library allows to visualize a trajectory from MDAnalysis in a jupyter notebook. Like most (if not all) visualization software, nglview does not fix the periodic boundary conditions, which can lead to ugly artifact with bond crossing the box. MDAnalysis could fix the periodic box in a way transparent for the visualization library.

Also, some transformations are painful to do with the tools embedded in the simulation packages, and can require multiple call to tools like gmx trjconv with several intermediate trajectory. With the proposed mechanism it would be possible to declare a workflow of transformations in MDAnalysis and apply them frame by frame, without intermediate files.

New APIs

User facing API

Only few changes are needed in the user facing API: a minima a method should be added to Universe to register a transformation, few methods could also be added to to inspect and modify the transformation workflow.

u = mda.Universe(this, that)
# Register a callback as a transformation
u.add_transformation(mda.transformations.awsome_transformation)
u.add_transformation(mda.transformations.other_transformation)
for ts in u.trajectory:
    # Do something with the frame;
    # the coordinates are transformed by
    # `awsome_transformation` and
    # `other_transformation`, in order.

A transformation is implemented as a callback that takes a TimeStep as argument and modify in on place. Using a callback means that a transformation can be implemented in the simple way possible for that transformation. The simplest transformations can be implemented as a function, so as transformations that require a constant argument:

import numpy
import functools

def constant_translation(ts, vector):
    """
    Translate all the atoms by a given vector
    """
    ts.positions += vector

# Calling my_translation(ts) will be the same as
# calling constant_translation(ts, vector) with
# vector = numpy.array([2, 3, 4])
my_translation = functools.partial(constant_translation,
                            vector=numpy.array([2, 3, 4]))

u = mda.Universe(this, that)
u.add_transformation(my_translation)

The more complicated transformations can be implemented as classes with a __call__ method.

Ideally, transformations that require access to the topology (e.g. making molecule whole) can access it via the universe attribute of the TimeStep for a minimum burden of the user.

It would be nice to be able to add several transformation in one go as a workflow:

workflow = [awsome_transform,
            other_transform,
            super_transform]
u = mda.Universe(this, that)
u.add_transformation_workflow(workflow)

Internal API

Internally, the Universe class must allow to register transformations and must pass them to the coordinate.base.ProtoReader class that apply them, in order, when reading a new frame.

This proposal share problems with proposal #785 by @richardjgowers. Both proposal need a way to declare method, and to execute them when reading a new frame. Auxiliary data should be read before executing the coordinate transformations so the transformation can use the auxiliary data.

Optionally, it could be useful to allow a coordinate transformation to add auxiliary (e.g. rotation matrix used for the fitting). This should be trivial if auxiliary data are attached the TimeStep.

Collection of transformations

For this proposal to be really useful, MDAnalysis must have a collection of transformations ready to use. The more obvious ones are the already implemented structure fitting, and PBC correction methods.

Limitations

Some exiting transformation require to change the topology. It is the case to any transformation that adds virtual particle, or any transformation that coarse grain or back map a trajectory.

For exemple, I recently wrote an script that adds virtual particle to each frame of a trajectory to test the possible bias of an analysis. Also, several of us work with coarse-grained models and need to analyse atomistic simulation as if they were coarse-grained.

These transformations require to adapt the topology of the system, and I do not see a clean way of doing it. Also, it is not clear if such transformations are in the scope of this proposal (even though I would love to be able to use them).

richardjgowers commented 8 years ago

https://github.com/richardjgowers/MDAnalysis-coarsegraining

I've already hacked together something like this for coarse-graining. The difference to other transforms mentioned is this changes the number of particles in the system as you're condensing many positions into one.

Adding a pbc/undo pbc would be a cool transform. I think some subroutines along these lines exist (lib.mdamath.make_whole and lib.distances.apply_PBC), but it would be good to make these more accessible.

@jbarnoud I see we're analysing the same system lately! :P

jbarnoud commented 8 years ago

@richardjgowers I've seen your CGUniverse class. I forgot to link it in the proposal. I am still not sure if I want this proposal to be able to change the topology, perhaps in as a second step.

When I added the virtual particle I mention above, I really would have like to visualize them with nglview.

Adding a pbc/undo pbc would be a cool transform. I think some subroutines along these lines exist (lib.mdamath.make_whole and lib.distances.apply_PBC), but it would be good to make these more accessible.

The tranforms exist already. But they have to be called on a per-frame basis. They will not be called by nglview for instance.

I see we're analysing the same system lately! :P

I hope you have fun with the system. I sure do :D

orbeckst commented 8 years ago

I'd love to have fit+pbc and remap in such a way that the unit cell is not rotating around the fitted group. I don't know of any tool that can do this even though it is clear that it's possible: you just have to carve the same Wigner-Seitz unitcell out of the inifinitely repeated (and rotated) system.

jbarnoud commented 8 years ago

@orbeckst Since the transformations get to alter the TimeStep, and since the unit cell is stored there, it should be possible.

richardjgowers commented 8 years ago

So I like this, I'd just change it so that the object which accepts transformations is the Reader, so all your u. calls become u.trajectory..

richardjgowers commented 8 years ago

We could store these just as a list in the Reader, then inside the Reader we'd have code like..

def _read_next_timestep(self):
    ts = self._read_next()
    for transform in self.transformations:
        transform(self)
    return ts

So each transform could be a partial function, or maybe a class that defines __call__. The class is useful for the case of making molecules whole, which needs bond information.. So you could do something like

from MDAnalysis.transformations import MakeWhole

u.trajectory.transforms.append(MakeWhole(u.bonds))

jbarnoud commented 8 years ago

@richardjgowers This is exactly what I had in mind.

I will try to come up with a working prototype in the next few days (with one or two transformations to play with). But I will be rather busy so do not hold your breath.

jbarnoud commented 8 years ago

_read_next_timestep seems to be the best place to implement the transformations. Yet, it is an abstract method of ProtoReader that get overload by each reader. I am thinking of wrapping the method so the readers overload a method that get called by _read_next_timestep. Maybe I am missing an obvious solution to that?

richardjgowers commented 8 years ago

@jbarnoud so essentially adding another layer between next and _read_next_timestep?

I might have a play with where's best for this... part of me thinks that you could use a decorator like

@with_transforms
def next(self):
    return self._read_next_timestep()

jbarnoud commented 8 years ago

so essentially adding another layer between next and _read_next_timestep?

Basically yes. But the decorator may be overkill. I was thinking about something like:

def _read_next_timestep(self):
    ts = self._read_next()  # Overloaded by the reader
    for transformation in self._transformations:
        ts = transformation(ts)
    return ts

kain88-de commented 8 years ago

Yes @jbarnoud solution looks good. Though I would put it directly into the next function. That way you won't have to change any reader code.

jbarnoud commented 8 years ago

@kain88-de Except that _read_next_timestep is used in various other places. It is used directly in __iter__, and in _read_frame at least.

The change in the readers would be small, just a matter of changing the name of a method where it is defined. Not even where it is called.

richardjgowers commented 8 years ago

So with how #868 has worked out, I don't really see why this can't be implemented using the aux mechanisms. Maybe we subclass aux's to get them to be called transforms everywhere, but I think the mechanics look sound.

jbarnoud commented 8 years ago

@richardjgowers I have a clear idea on how it can get implemented alongside aux, and a rough idea on how it can interact with aux. Could you elaborate on how it could get implemented uning aux?

richardjgowers commented 8 years ago

    def _read_frame_with_aux(self, frame):
        """Move to *frame*, updating ts with trajectory and auxiliary data."""
        ts = self._read_frame(frame)
        for aux in self.aux_list:
            ts = self._auxs[aux].read_ts(ts) 
        return ts

With the snippet above, there's no reason with aux couldn't modify the positions in ts. A lot of the nomenclature is a little confusing (read_ts should be modify_ts) but all the hooks are in the right place.

richardjgowers commented 6 years ago

@davidercruz this is all relevant to you

davidercruz commented 6 years ago

Hey there. After talking with @jbarnoud, he proposed that I look at the reader codes and make a scheme of what happens when we iterate over a trajectory. Here's what I found:

When the user loads a trajectory through the MDAnalysis Universe class eg. u = mda.Universe(PSF, DCD)

If the trajectory exists the load_new() method of the Universe class is called;
inside load_new() the get_reader_for() method of the _get_readers module, which guesses the format and returns the appropriate reader (DCDReader for the example above);
still during the load_new() call, an instance of the "Chosen"Reader class is created. The 1st frame is read and converted to a Timestep. Finally native units are converted to MDAnalysis base, and a Timestep instance is returned by the reader;
finally, the trajectory property from the Universe class becomes a reference to the trajectory reader object, and the user can access the methods from the ProtoReader and all the subclasses.

When the user requests the another frame of the trajectory eg. u.trajectory.next()

the method next() from the ProtoReader creates an instance of Timestep, ts, by calling the _read_next_timestep() method, which is overloaded by the _read_next_timestep() of the reader class for that format;
inside the _read_next_timestep() of the reader class, the next frame is read from the file and converted to Timestep as before. Units are also converted as before.
next() returns ts

@richardjgowers, I saw #1215 . From what I understand, the transformations would be applied every time a frame is loaded from the file. But what happens when the trajectory is loaded into the memory? Is the frame actually modified permanently? @jbarnoud was telling me that the MemoryReader would be problematic due to this, as we would need to keep track of whether the frame was already modified before (eg. after reaching the EOF and rewinding). Otherwise the auxiliaries look like a good way to apply on-the-fly transformations.

A solution would be to apply the transformations when the trajectory is transferred to the memory. What do you think?

richardjgowers commented 6 years ago

@davidercruz yep memoryreader will be a headache. I agree doing it all up front is the best solution, so each time a transformation is added, the (in-memory) trajectory is iterated through and the transformation applied. (You could maybe come up with some solution which lazily applies a transformation the first time a frame was requested, but that just sounds like asking for trouble.)

Your schematic of what happens when iterating looks solid, there's also __getitem__ and sliced iteration but they work in pretty much the same way. So really here is ideally the only place you're "allowed" to edit Readers to make your transformations happen on any frame access.

You could redefine MemoryReader._read_frame_with_aux() to avoid the double transformation problem, but it's a little ugly and introduces an extra burden as you then have to maintain two implementations of the Reader API. An interesting idea would be for MemoryReader.add_transformation() (which would override BaseReader.add_transformation()) to apply then neuter any transformations a user adds, so that when they're later called in BaseReader._read_frame_with_aux() they're inert.

davidercruz commented 6 years ago

@richardjgowers yes, that place avoids changes to the format readers. My only issue is that it's not what the function was initially designed to due. But code evolves :D

What about changing things at the reader level? For example, the _read_next_timestep method of the format reader classes seems to be always called (except when the trajectory is first read). So before returning ts the transformations could be applied. The advantage is that I think it would give us more freedom with our implementation, and could make it easier to read the code. The disadvantage is the same lines of code on multiple modules. I'll try it with simple function later.

jbarnoud commented 4 years ago

This proposal is now implemented.

MDAnalysis / mdanalysis