tylerjereddy commented 8 years ago

Just thought it would be a good idea to 'bookmark' this idea from one of the hackathon teams I was coaching a bit at the recent CECAM 2015 workshop in Germany. I hope they will follow through with a pull request at some point.

In short, there are almost certainly better ways for MDAnalysis to deal with coarse-grained (CG) simulation data (i.e., MARTINI). Of course, MDAnalysis atoms objects aren't really atoms in a CG context, but rather represent the basic fundamental units in the coordinate file (i.e., a single set of Cartesian coordinates isn't necessarily an actual atom, but can also be a representative particle). While I've gotten rather used to doing this 'mental translation' of nomenclature, and there are other things like dihedral angles that also mean something completely different if you are dealing with a CG context, there are at least some system properties that could probably be handled in a clearly 'better' way without too much effort. This is probably most clearly demonstrated with a simple example of the two main properties focused on by the team -- charge and mass:

import MDAnalysis
u = MDAnalysis.Universe('popc-1500-CG-phosphates.gro') #only contains CG PO4 particles
sel = u.select_atoms('bynum 1:10') #select first 10 phosphate particles in system 
sel.atoms.names
>>> array(['PO4', 'PO4', 'PO4', 'PO4', 'PO4', 'PO4', 'PO4', 'PO4', 'PO4', 'PO4'], 
      dtype='|S3')
sel.atoms.masses
>>> array([ 30.974,  30.974,  30.974,  30.974,  30.974,  30.974,  30.974,
        30.974,  30.974,  30.974])
sel.atoms.charges
>>> array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

MDAnalysis is inferring the masses based on the element corresponding to the first character in the atom name string--which is phosphorus here. Most CG MARTINI particles have masses of 72.0 (4 * water, based on the built-in general 4:1 mapping used in the forcefield), although some (i.e., ring particles in some amino acids) have smaller masses. The charges are not being handled at all really (should all be -1 here).

While I do agree it would be a good idea to return the correct values, I do have to concede that in this case I've never actually wanted to probe the masses or charges of CG particles, and the masses in particular don't strike me as having any immediately obvious analytical value.

The team was also considering automatic parsing of the forcefield files to reduce manual intervention (and maintenance, moving forward), and automatically inferring the type of coarse grained forcefield (as well as allowing a keyword argument specification to Universe). Those are probably quite ambitious ideas! I think @orbeckst also mentioned that some of the underlying infrastructure may be subject to reorganization at some point soon.

mnmelo commented 8 years ago

I agree that having a cleverer MDAnalysis in this respect might be helpful to the uninitiated, but it seems these cases can be solved by passing in the .tpr instead of the .gro for the topology. If you're coming from GROMACS, you'll have faced the same issues there.

But I don't mean to say we should stick to the same limitations as GROMACS. If we can find a way to have a (tunable) smart behavior, then I'm all for it. ('Tunable' because we need to be careful about the assumptions we make of a system from atom/residue names alone...)

richardjgowers commented 8 years ago

I think all the guessers for mass/type are currently hard coded to expect proteins. It would be cool to make this more flexible so you could load a universe and say u = mda.Universe(..., type='martini'). This would then choose a different set of lookup tables in the back.

With Martini having a finite well defined set of particles, it should be a good place to start.

mnmelo commented 8 years ago

This sounds like a good way to me.

tylerjereddy commented 8 years ago

Yeah, that's what they had started doing.

orbeckst commented 8 years ago

315 and #104 are related to a more flexible auto-detection of atom types.

Generally, it would be nice if you could easily say "select protein" or "select lipid" or "select nucleic" without having to care about your forcefield.

(Note that there are already some corner cases that are arguably bugs: if you select a protein with an amino acid such as glutamate or leucine bound as a substrate, the protein selection will add the substrate to the "protein"... perhaps we should do a fragment check for "protein" and introduce another selection such as "aminoacid" for the old behavior).)

jbarnoud commented 8 years ago

I like what @richardjgowers proposes. By giving more context to the Universe we allow it to be smarter with atom and residue names. Yet, how would the Universe manage conflicts between informations read from the topology and from the given context? For instance, if I use Martini but I changed the mass of a Martini bead (for any weird reason), then the mass read from the TPR will differ from the expected Martini mass. I guess, MDAnalysis should use the one from the TPR, right?

Also, instead of something like u = mda.Universe(..., type='martini'), I would suggest something like

universe_type = mda.UniverseTypeMartini()
u = mda.Universe(..., type=universe_type)

This would allow more flexibility for the enduser. For instance, if I use a custom lipid I can add it to the list of known lipid residue by overwrite or inherit from mda.UniverseTypeMartini. Also we can have something like mda.UniverseType.from_file() to store force field context in files.

315 and #104 suggest the use of user wide RC files. I really dislike this idea as it is extremely error prone. Indeed, the expected behaviour of MDAnalysis will change from user to user meaning a script that works for me may not work for a colleague. Worst, my RC file may do assumptions on what system I work with (like assuming I only work with Martini simulations) that are easily forgot when working on a different system. Having the context passed as an object would allow to create the instance from a file; this would fix #315 and #104, without having the issues with RC file as the files are explicitly loaded.

richardjgowers commented 8 years ago

I think using the guess_* functions will always be a last option when parsing.

I think your idea of UniverseType will boil down to being a set of dicts for different attributes (ie mapping type to mass etc), so should be very possible. Rather than external .rcs I'd rather have a set of built in types, which are easy to hack into.. something like

mytype = mda.UniverseType.MARTINI

mytype.mass['this guy'] = 100.0

u = mda.Universe(... type=mytype)

But then these aren't really types, they're more like forcefields. If we actually included forcefields in mda it would open up a lot of cool possibilities (along with less cool headaches :D)

jbarnoud commented 8 years ago

On 26/10/15 16:46, Richard Gowers wrote:

I think using the guess_* functions will always be a last option when parsing.

I think your idea of |UniverseType| will boil down to being a set of dicts for different attributes (ie mapping type to mass etc), so should be very possible. Rather than external .rcs I'd rather have a set of built in types, which are easy to hack into.. something like

|mytype = mda.UniverseType.MARTINI mytype.mass['this guy'] = 100.0 u = mda.Universe(... type=mytype) |

But then these aren't really types, they're more like forcefields. If we actually included forcefields in mda it would open up a lot of cool possibilities (along with less cool headaches :D)

— Reply to this email directly or view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/507#issuecomment-151179640.

Indeed, having force fields in mda requires a lot of effort to stay up to date. But if not for force field, what do you call types?

orbeckst commented 8 years ago

On 26 Oct, 2015, at 02:15, Jonathan Barnoud wrote:

315 and #104 suggest the use of user wide RC files. I really dislike this idea as it is extremely error prone. Indeed, the expected behaviour of MDAnalysis will change from user to user meaning a script that works for me may not work for a colleague. Worst, my RC file may do assumptions on what system I work with (like assuming I only work with Martini simulations) that are easily forgot when working on a different system. Having the context passed as an object would allow to create the instance from a file; this would fix #315 and #104, without having the issues with RC file as the files are explicitly loaded.

You're making a good point regarding RC files. Perhaps we need a more nuanced discussion ta #315 � I think there are some settings (the stuff in core.flags) that can definitely be put into RC files (although my impression is that no-one actually changes flags and then we might just be doing a lot of work for nothing).

I still think that a user should be able to have a say in how detection works. For the H-bond analysis we are following the example that you laid out: if it's a big change: subclass [1]. It seems worthwhile to discuss, which pattern works best for the user without falling into the trap of unintentionally broken installations.

[1] https://pythonhosted.org/MDAnalysis/documentation_pages/analysis/hbond_analysis.html#detection-of-hydrogen-bonds

MDAnalysis / mdanalysis

MDAnalysis CG / MARTINI awareness #507

315 and #104 are related to a more flexible auto-detection of atom types.