Open anabiman opened 4 years ago
Should velocities and forces even be part of the MM Molecule definition? Maybe another way of thinking about this is to retain mmelemental.Molecule
as a subclass of qcelemental.Molecule
and use the geometry
Field to store the coordinates, while velocities/forces are shifted to a Trajectory
class. In other words, mmelemental.Molecule
would never store any time-dependent data. The Trajectory
class would look something like this:
class Trajectory:
topology: Union[Molecule, Array[Molecule]]
coordinates: Union[Array[float], Array[Array[float]]]
velocities: Union[Array[float], Array[Array[float]]]
forces: Union[Array[float], Array[Array[float]]]
This would be an efficient implementation for systems with static topologies. However, if someone wants to instantiate mmelemental.Molecule
from a .gro file which can store velocities and/or forces, the conversion would be either (1) lossy or (2) require the use of Trajectory
class instead.
In the context of writing MMIC translators for MMSchema-MDAnalysis, this problem would then become converting MDAnalysis.Universe
to/from mmelemental.Trajectory
. Is this awkward? I think not, because MDAnalysis.Universe
is implicitly a trajectory class. This means we need 2 types of translators, one for topology & another for trajectory.
So if I'm understanding this correctly, you would be proposing two separate classes, one that handles purely single frame inputs, and one that handles trajectories?
I would suggest instead having Trajectory
be a generator method that Molecule
can be hooked up to, which feeds time-dependent (or in the case of RDKIT just different conformations) from the "input object". This way you can update coordinates over time and avoid the issue of storing everything in memory (otherwise I think you'll very quickly get into OOM errors).
I realise this is very much exactly how MDAnalysis handles things (so my viewpoint is probably biased), where you have a "static" Universe object that contains all the system information and then a timstep object that updates the system coordinates as you traverse through the trajectory (slight simplification of things, but close enough).
Edit; this way, you can have everything be a class that derives from Molecule
and on implementation of a converter you can define if this is going to be a singleframe object (which doesn't need the trajectory generator) or a multiframe object.
You've raised many good points so let me tackle them one by one:
I'm proposing one class that defines a molecule exactly the way qcelemental does now i.e. a static molecule without the velocities/forces, and then another class Trajectory
that stores dynamical information, could be a single or multiple Molecule(s) (multiple conformations, frames, etc.) with the atomic velocities and forces. In other words, one class is Statics
, the other is Dynamics
. This makes sense to me because a molecule should not be physically defined by its velocities or forces.
The issue you raise about OOM errors is very important. I think libraries like MDTraj load a single frame in memory while others like MDAnalysis (as you pointed out) load everything at once. There are advantages and drawbacks to each, but I would like to leave this implementation detail to the application developer. With the Trajectory
class definition above, we could load a single frame at any given time i.e. topology
is a single Molecule
, coordinates
/velocities
/forces
are 1D arrays, but topology
could also be an array of Molecules
, and similarly coordinates
, etc. are 2D arrays that store all frames in memory the way MDAnalysis does. This way, the developer writing a component would choose their own implementation, and the schema is still the same no matter which approach is adopted. In practice, we'll attach a helper method that dictates how self.__iter__
handles/loads trajectory frames.
class Trajectory:
topology: Union[Molecule, Array[Molecule]]
coordinates: Union[Array[float], Array[Array[float]]]
velocities: Union[Array[float], Array[Array[float]]]
forces: Union[Array[float], Array[Array[float]]]
def from_file(self, file: FileInput, load_all: bool):
...
Keep in mind there is no guarantee the trajectory (or any) file persists when MMIC components are deployed in a pipeline. If the developer writes a component that deletes an output e.g. trajectory file at the end of the computation, and the component returns a Trajectory object, then all frames must be stored by this object. If another developer is writing a component that does analysis of a trajectory file and returns something else, then it makes sense to load every frame at a time. So, overall, I think it makes sense to leave this to the developers writing the components in MMIC, which only enforces the input/output models but not the implementation details.
We can make Trajectory
a subclass of Molecule
as you're suggesting but that would mean only a single frame can be loaded at a time (if we assume QCSchema is not going to change, and I highly doubt it will given how qcelemental handles trajectories in QM).
ATM the Molecule model stores atomic information for a single frame. Generalizing this to a
Trajectory
class that stores an array ofMolecule
would be inefficient especially for classical MD simulation in which the topology remains static. Therefore, I'm leaning towards refactoring theMolecule
model into 2 by separating the topology from the atomic attributes (coordinates, velocities, etc.) which would then support time series. This would require either dropping qcelemental's Molecule model as a base, or asking @bennybp to change QCSchema.The 3 core models would then become:
coordinates
,velocities
,forces
,Topology
-> Efficient implementation for static topologies / closed systems typically used in the bio-MD world. Naturally supports trajectories.Molecule
connectivity-related info,symbols
,residues
, etc. -> Generalizable to support time-dependent topologies (Everything is a multi-dimensional array) for use with reactive forcefields, open systems for the materials community, etc.Pinging @IAlibay @lilyminium @fiona-naughton