On-the-fly transformation for principal component projection

orbeckst commented 4 years ago

Is your feature request related to a problem?

The analysis.pca module calculates the principal components (PC) but does not have a built-in way to generate a trajectory that is projected on a PC (or "filtered by a PC").

Describe the solution you'd like

I would like to be able to generate and analyze a trajectory X_i(t) that was projected on the i-th PC p_i:

X_i(t) = [(X(t) - <X>) . p_i] p_i + <X>

where X(t) is the 3N coordinate vector (the original trajectory), <X> is the mean (that was also used for PCA) and . is the inner product of the deviation from the mean with the PC.

This projection can be implemented as a on-the-fly transformation, which will make it easy to write out a filtered trajectory or just analyze it.

Describe alternatives you've considered

Filtering could be built into the PCA class similar to the PCA.transform() method but then one might be limited to just writing out a trajectory.

One possibility is to add a PCA.project_single_frame(ts) method that does the projection for a single time step. This method could then be used directly for the otf-transformation.

Unsolved issues

One potential issue is that often the PCA is calculated only for a subset of atoms such as the CA. How does one meaningfully extend it to, say, all protein atoms? Should one just apply the same projection to all atoms in a residue, at least if the PCA was residue based? What should one do (or how should one fail) if the PCA was performed just on an arbitrary subset of atoms?

Does it make sense to apply the projection to only a protein and let solvent just move as in the original trajectory?

rishu235 commented 2 years ago

Hi MDAnalysis team,

I have created the pull request #3596 referencing this issue to be eligible for GSOC. I have implemented an inverse PCA transform that filters the trajectory along selected principal component(s). I have a few questions/suggestions on the unsolved issues listed by @orbeckst

One potential issue is that often the PCA is calculated only for a subset of atoms such as the CA. How does one meaningfully extend it to, say, all protein atoms?

Transforming the atoms involved in PCA was straightforward as per the solution suggested in the issue. Additionally, I can add an option to reconstruct the atoms not part of the PCA transformation, say, atoms not in the backbone, in two ways: 1) residue-wise alignment of the backbone is used to transform the side chain coordinates in each residue, 2) the side chain is translated according to the movement of the reference group (e.g., CA or COM of the residue backbone) caused by the PCA filter.

Should one just apply the same projection to all atoms in a residue, at least if the PCA was residue based?

What is meant by residue-based PCA? If the principal components are calculated with all the atoms in a residue, what is the problem in projecting all the atoms in the residue along a principal component?

If PCA is performed with backbone atoms or CA atoms, PCA works in a 3*n dimensional space, n being the number of atoms used for PCA. Will it be meaningful to use the learned projection for a different set of atoms which likely exists in a higher dimensional space?

What should one do (or how should one fail) if the PCA was performed just on an arbitrary subset of atoms?

Are we looking for when and how to warn the user? Is it regarding only PCA? or the filtered trajectory too? One could look for a structural evidence in the filtered trajectory. For example, an outrageous bond/angle value, a mismatch in the secondary structure of the raw trajectory and the filtered trajectory etc. However, this kind of detection would have a high resource demand and possibly become a barrier for an on-the-fly transformation. So, what kinds of tests do we devise here?

Does it make sense to apply the projection to only a protein and let solvent just move as in the original trajectory?

I believe it would be okay for a high-level analysis (e.g., SASA) but prove to be meaningless regarding a knowledge of the solvation structure.

orbeckst commented 2 years ago

Thanks for the detailed questions and comments and PR #3596.

General comments

Given that the PR should help you to fulfill our GSOC requirement, we probably need to limit the scope a bit. From the discussion below it seems that there's quite a lot that needs to be figured out for a comprehensive solution.

I'd also love to hear from other developers such as @lilyminium @jbarnoud and users @rsexton2 what they think.

Discussion

One potential issue is that often the PCA is calculated only for a subset of atoms such as the CA. How does one meaningfully extend it to, say, all protein atoms?

Transforming the atoms involved in PCA was straightforward as per the solution suggested in the issue. Additionally, I can add an option to reconstruct the atoms not part of the PCA transformation,

Yes, that would be needed. Do you know if there's literature on this problem?

How does, e.g., gmx anaeig deals with generating trajectories from PCA?

say, atoms not in the backbone, in two ways: 1) residue-wise alignment of the backbone is used to transform the side chain coordinates in each residue,

That requires PCA on the backbone. Then the structural superposition of the original residue can be performed as you suggest. Perhaps this could be generalized to "if we have at least 3 atoms per residue in the PCA then we do a structural superposition on a per-residue basis and then reset the 3 anchor atoms to the PCA coordinates". (Probably very expensive because doing this "per residue" is not going to be fast.)

What do you do if you only have a CA PCA? How do you get the backbone coordinates?

2) the side chain is translated according to the movement of the reference group (e.g., CA or COM of the residue backbone) caused by the PCA filter.

That would apply also to backbone atoms for a CA PCA, right? In essence, (2) means "any particle in a residue that wasn't part of the PCA will be transformed rigidly in the same way".

This sounds like a decent default approach even though it will surely generate stereochemically incorrect structures.

Should one just apply the same projection to all atoms in a residue, at least if the PCA was residue based?

What is meant by residue-based PCA?

If atoms are picked for the PCA that characterize one residue such as CA or backbone. Just what was discussed in your approach (2).

If the principal components are calculated with all the atoms in a residue, what is the problem in projecting all the atoms in the residue along a principal component?

As long as the PCA atoms sensibly characterize the residue, doing the per-residue transformation (approach (2)) could give a usable result, at least for visualization. Stereochemistry will likely be wrong and we may have clashes.

If PCA is performed with backbone atoms or CA atoms, PCA works in a 3*n dimensional space, n being the number of atoms used for PCA. Will it be meaningful to use the learned projection for a different set of atoms which likely exists in a higher dimensional space?

Probably not in general, which is where my next question was aimed at:

What should one do (or how should one fail) if the PCA was performed just on an arbitrary subset of atoms?
Are we looking for when and how to warn the user? Is it regarding only PCA? or the filtered trajectory too?

I am referring to the filtering and trajectory reconstruction. The user can choose to do the PCA any way they want. But this PCA may not be appropriate for the desired filtering so we would want to have some check there. The safest (and most correct approach) is to fail early and loudly if we are not sure that we can produce a correct result.

Additionally, it's not a given that the input structures are proteins. MDAnalysis is used for all kinds of analysis, including generic polymers and small molecules. Coarse grained models are also often analyzed. We shouldn't make too many assumptions about what data we're working with.

One could look for a structural evidence in the filtered trajectory. For example, an outrageous bond/angle value, a mismatch in the secondary structure of the raw trajectory and the filtered trajectory etc. However, this kind of detection would have a high resource demand and possibly become a barrier for an on-the-fly transformation.

Yes, and as I said, it might be coarse grained, not a protein, etc, so you will likely only be able to make minimal assumptions. Looking at topology attributes sounds reasonable but that might be better implemented as a separate validation check in a trajectory analysis instead of the on-the-fly code.

So, what kinds of tests do we devise here?

That's an excellent question!! I was mostly thinking about things that we can check right at the start when we know the PCA and the trajectory (e.g., do we have at least one PCA "anchor point" in each residue if we do the "per residue translation" or do we have the same number of PCA atoms as in the trajectory if we only do strict application of the PCA.)

Does it make sense to apply the projection to only a protein and let solvent just move as in the original trajectory?

I believe it would be okay for a high-level analysis (e.g., SASA) but prove to be meaningless regarding a knowledge of the solvation structure.

I am inclined to agree. It might be interesting for ligand binding although then one might want to include the ligand in the PCA, too. This looks like another of the situations where ultimately the user needs to decide if it makes sense.

orbeckst commented 2 years ago

@rishu235 's https://github.com/MDAnalysis/mdanalysis/pull/3596#issuecomment-1093660423 from the PR contains a link to movies of AdK filtered on PC1. I updated the timings with numbers from https://github.com/MDAnalysis/mdanalysis/pull/3596#issuecomment-1094119169

Looping through 97 frames in a protein trajectory, testfiles (PSF, DCD) with 214 residues and 3341 atoms, takes

with no transformation: 13.3 ms ± 99.3 µs : raw trajectory 1_raw.mov
backbone PCA, project on backbone and do not transform other atoms: 21.1 ms ± 749 µs with project(0), PCA atoms (backbone) only and non-PCA atoms untouched; one sees how the sidechains do not connect to the backbone: 2_inverse_no_anchor.mov
backbone PCA, extend transformations to residues by applying the PC of the CA ("anchor") to the whole residue: 36.4 ms ± 1.04 ms, with project(0, group=all, anchor='name CA'), backbone-PCA extended to non-PCA atoms by anchoring to a PCA atom (CA) in each residue: 3_inverse_anchor.mov
NO PR YET: backbone PCA, project on backbone, structurally superposition residue on backbone to move the sidechain: ???, 4_inverse_align.mov

When watching the movies on @rishu235 's OneDrive, increase the resolution setting for the movies to 1080p to see all the details. The Option 4 only exists as a proof-of-concept.

MDAnalysis / mdanalysis