Support for Trajectories

ATM the Molecule model stores atomic information for a single frame. Generalizing this to a Trajectory class that stores an array of Molecule would be inefficient especially for classical MD simulation in which the topology remains static. Therefore, I'm leaning towards refactoring the Molecule model into 2 by separating the topology from the atomic attributes (coordinates, velocities, etc.) which would then support time series. This would require either dropping qcelemental's Molecule model as a base, or asking @bennybp to change QCSchema.

The 3 core models would then become:

Molecule: stores coordinates, velocities, forces, Topology -> Efficient implementation for static topologies / closed systems typically used in the bio-MD world. Naturally supports trajectories.
Topology: stores Molecule connectivity-related info, symbols, residues, etc. -> Generalizable to support time-dependent topologies (Everything is a multi-dimensional array) for use with reactive forcefields, open systems for the materials community, etc.
ForceField: stores the inter-atomic potential params mapped to the structure. Important to keep this separate from the molecular topology.

Pinging @IAlibay @lilyminium @fiona-naughton

Should velocities and forces even be part of the MM Molecule definition? Maybe another way of thinking about this is to retain mmelemental.Molecule as a subclass of qcelemental.Molecule and use the geometry Field to store the coordinates, while velocities/forces are shifted to a Trajectory class. In other words, mmelemental.Molecule would never store any time-dependent data. The Trajectory class would look something like this:

class Trajectory:
     topology: Union[Molecule, Array[Molecule]]
     coordinates: Union[Array[float], Array[Array[float]]]
     velocities: Union[Array[float], Array[Array[float]]]
     forces: Union[Array[float], Array[Array[float]]]

This would be an efficient implementation for systems with static topologies. However, if someone wants to instantiate mmelemental.Molecule from a .gro file which can store velocities and/or forces, the conversion would be either (1) lossy or (2) require the use of Trajectory class instead.

In the context of writing MMIC translators for MMSchema-MDAnalysis, this problem would then become converting MDAnalysis.Universe to/from mmelemental.Trajectory. Is this awkward? I think not, because MDAnalysis.Universe is implicitly a trajectory class. This means we need 2 types of translators, one for topology & another for trajectory.

So if I'm understanding this correctly, you would be proposing two separate classes, one that handles purely single frame inputs, and one that handles trajectories?

I would suggest instead having Trajectory be a generator method that Molecule can be hooked up to, which feeds time-dependent (or in the case of RDKIT just different conformations) from the "input object". This way you can update coordinates over time and avoid the issue of storing everything in memory (otherwise I think you'll very quickly get into OOM errors).

I realise this is very much exactly how MDAnalysis handles things (so my viewpoint is probably biased), where you have a "static" Universe object that contains all the system information and then a timstep object that updates the system coordinates as you traverse through the trajectory (slight simplification of things, but close enough).

Edit; this way, you can have everything be a class that derives from Molecule and on implementation of a converter you can define if this is going to be a singleframe object (which doesn't need the trajectory generator) or a multiframe object.

You've raised many good points so let me tackle them one by one:

I'm proposing one class that defines a molecule exactly the way qcelemental does now i.e. a static molecule without the velocities/forces, and then another class Trajectory that stores dynamical information, could be a single or multiple Molecule(s) (multiple conformations, frames, etc.) with the atomic velocities and forces. In other words, one class is Statics, the other is Dynamics. This makes sense to me because a molecule should not be physically defined by its velocities or forces.
The issue you raise about OOM errors is very important. I think libraries like MDTraj load a single frame in memory while others like MDAnalysis (as you pointed out) load everything at once. There are advantages and drawbacks to each, but I would like to leave this implementation detail to the application developer. With the Trajectory class definition above, we could load a single frame at any given time i.e. topology is a single Molecule, coordinates/velocities/forces are 1D arrays, but topology could also be an array of Molecules, and similarly coordinates, etc. are 2D arrays that store all frames in memory the way MDAnalysis does. This way, the developer writing a component would choose their own implementation, and the schema is still the same no matter which approach is adopted. In practice, we'll attach a helper method that dictates how self.__iter__ handles/loads trajectory frames.

class Trajectory:
     topology: Union[Molecule, Array[Molecule]]
     coordinates: Union[Array[float], Array[Array[float]]]
     velocities: Union[Array[float], Array[Array[float]]]
     forces: Union[Array[float], Array[Array[float]]]

     def from_file(self, file: FileInput, load_all: bool):
            ...

Keep in mind there is no guarantee the trajectory (or any) file persists when MMIC components are deployed in a pipeline. If the developer writes a component that deletes an output e.g. trajectory file at the end of the computation, and the component returns a Trajectory object, then all frames must be stored by this object. If another developer is writing a component that does analysis of a trajectory file and returns something else, then it makes sense to load every frame at a time. So, overall, I think it makes sense to leave this to the developers writing the components in MMIC, which only enforces the input/output models but not the implementation details.
We can make Trajectory a subclass of Molecule as you're suggesting but that would mean only a single frame can be loaded at a time (if we assume QCSchema is not going to change, and I highly doubt it will given how qcelemental handles trajectories in QM).

MolSSI / mmelemental

Support for Trajectories #2