MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.31k stars 648 forks source link

Add contexts to guess attributes better (especially elements and masses) #2630

Open lilyminium opened 4 years ago

lilyminium commented 4 years ago

I started writing this as a comment in #2553 but it got, wow, really long. Let me know if I should move it there to carry on the discussion, or to #598.

Is your feature request related to a problem?

I think any effort to work with element-dependent stuff like finding hydrogens from H-bond donors (#2521) will be hampered by the element/type guesser, which is not good (...noting that I wrote the current version). Workarounds like looking at mass only work if the mass is not derived from an incorrectly guessed element. Few file formats directly provide element information, so it needs to be guessed from other information. In practice elements are only guessed from atom names, leading to all sorts of fun results.

Here I round up a bit of what has been discussed in the past and (re-)propose how it could be improved. This is more of an initial idea to gather suggestions than a real solution. Please let me know what you think!

Summary

The problem of guessing elements (and topology attributes in general) has come up many times (#598, #942, #1808, #2331, #2348, #2364, #2553). Some non-comprehensive history of discussion and current state of affairs:

In #2553 there has been interesting discussion of how to guess elements appropriately, and what we can infer from them (e.g. element <-> mass is no longer so straightforward, given HMR systems).

Describe the solution you'd like

We need a better element guesser (which then results in better guessing of related properties). With that solution, (imo) we should consider these:

And these would be nice to have:

Proposal

This is just @jbarnoud and @mnmelo's proposal but with more spitballing --

These could either be part of the Context class, be their own class, or (my favourite) be class methods of TopologyAttr. I don't think bundling them with Context is so helpful because most of the CG/HMR issues can be solved by just changing the values in the table. This also means we don't have to validate categorical values like elements ourselves, we just look them up in the table.

API

An API that matches add_TopologyAttr would be convenient for the user:

    >>> u = mda.Universe(PSF, DCD)  # <-- just a protein
    >>> u.guess_TopologyAttr('elements', from_attr=['names'], context='no-alpha-carbons')
    >>> u.atoms[u.atoms.names=='CA'].elements
    array(['Ca', 'Ca', ..., 'Ca'])
    >>> u.guess_TopologyAttr('elements', from_attr=['masses'], context='no-alpha-carbons')
    >>> u.atoms[u.atoms.names=='CA'].elements
    array(['C', 'C', ..., 'C'])

and developers when working with Parsers, which all contribute different information:

    el = Elements.guess_from(names=[...], resnames=[...])  # <-- no need to write guessed=True

Additional context

Below is an example implementation that could work well with that desired API. It is far from the real thing, just to show what I'm thinking.

Context class that contains the data and looks up close matches and stuff

class Context(metaclass=ContextMeta):

    df = pd.DataFrame(my_data)
    name = 'base'

    def lookup(self, attrname, strict=False, **from_attrs):
        """Return exact matches for variable number of attributes"""

        # get overlapping info
        info_attrs = [a for a in from_attrs if a in self.df.columns]
        if not info_attrs:
            raise ValueError('No valid guessing information given')

        # set up results
        values = np.array([from_attrs[a] for a in info_attrs]).T
        found = np.array([None]*values.shape[0], dtype=object)

        # group dataframe by info, get the first match for each combination
        columns = info_attrs+[attrname]
        first_df = self.df.groupby(*info_attrs, sort=False).first()
        first_arr = first_df[columns].values

        if strict:
            valid = np.ones_like(values, dtype=bool)  # match all exactly
        else:
            valid = (values!=None) & (first_arr!=None)  # or something like this
                                                        # that caters for charge==0 etc
        masked = values[valid]
        for i, row in enumerate(first_arr[valid]):
            matches = np.all(masked == row[:-1], axis=1)
            found[matches] = row[-1]
        return found

    def find_closest_match(self, attrname, value, return_attr=None, threshold=None):
        """Return close matches or None"""

        if return_attr is None:
            return_attr = attrname

        options = self.df[attrname].values
        if np.issubdtype(options.dtype, np.number):
            i = np.argmin(np.abs(options-value))
            # probably die if difference is greater than a % threshold

        elif np.issubdtype(options.dtype, np.character):
            found_idx = [[]]
            # try different capitalisation
            try_values = [value, value.capitalize(), value.upper(), value.lower()]

            # look for element names in the given value
            while len(found_idx[0]) == 0 and try_values:
                v = try_values.pop(0)
                start = np.char.find(options, value)
                found_idx = np.nonzero(start!=-1)
            if len(found_idx[0]) == 0:
                # not found, return None
                return

            # pick closest match (lowest number of other letters), then earliest
            remaining = np.str_len(options[found_idx]) - len(value)
            r, i_r, n_r = np.unique(remaining, return_counts=True, return_inverse=True)
            if n_r[0] > 1:
                lowest_indices = i_r[0]
                start_i = np.argmin(start[found_idx[lowest_indices]])
                i = found_idx[lowest_indices[start_i]]
            else:
                i = found_idx[i_r[0]]
        else:
            warnings.warn('What other dtypes does MDAnalysis have?')
            return

        return self.df[return_attr].values[i]

Example guessing method for Elements

This is actually pretty general and I guess there could be some base method where you pass in match_exact and match_similar. It takes both instantiated TopologyAttr objects and keyword-named arrays, and returns an Element instance.

@classmethod
def guess_from(cls, *args, context='base', **kwargs):
    info = dict((a.attrname, a.values) for a in args)
    for attrname, values in kwargs.items():
        info[attrname] = np.asarray(values)

    if not info:
        raise ValueError('Nothing to guess from')

    my_context = _CONTEXTS[context]

    n_atoms = len(next(iter(info.values())))
    values = np.array([None]*n_atoms, dtype=object)
    missing = np.ones_like(values, dtype=bool)
    info_used = set()

    # exact types > identifying masses/radii > names
    match_exact = [('types',),
                   ('masses', 'radii'),
                   ('names', 'resnames')]

    while np.any(missing) and match_exact:
        names = match_exact.pop(0)
        attr_info = dict((k, info[k][missing]) for k in names if k in info)
        if attr_info:
            found = my_context.lookup('elements', **attr_info)
            valid = found!=None
            if np.any(valid):
                values[missing][valid] = found[valid]
                info_used |= set(names)
                missing = values == None

    match_similar = ['names', 'types']

    while np.any(missing) and match_similar:
        name = match_similar.pop(0)
        if name in info:
            to_find, idx = np.unique(info[name][missing], return_inverse=True)
            matches = [my_context.find_closest_match('elements', n) for n in to_find]
            matches = np.array(matches)[idx]
            valid = matches!=None
            if np.any(valid):
                values[missing][valid] = matches[valid]
                info_used.add(name)
                missing = values == None

    warnings.warn(f'Guessed elements from {info_used}')  # one warning only
    if np.any(missing):
        warnings.warn(f'Some elements could not be found and are set to ""')
    return cls(values, guessed=True)
jbarnoud commented 4 years ago

I did not look at the code, but I find the proposal sound. Guessers have always been what frustrates me the most in mdanalysis.

One thing to be aware of is that some contexts will require maintenance and careful versioning. I think especially about a context for the martini coarse grained force field : to guess masses from a gro file, you need your context to know the mapping for all the residues in your universe. Yet, those get updated from time to time. Similar issues will come with united atom force fields.

The context should be as easy as possible to create, extend, and modify without editing the code of mdanalysis.

lilyminium commented 4 years ago

@jbarnoud Thanks for having a look!

One thing to be aware of is that some contexts will require maintenance and careful versioning.

Yes, this is why I think being able to read from a file (preferably straight from a force field file) would be important. In that scenario the user would ideally be to be able to override whatever masses are generated from the parser with:

>>> context = Context.from_files('martini.itp', 'martini_aa.itp',
                                 'martini_ions.itp', name='my_martini')
>>> u = mda.Universe('martini.gro')
>>> u.guess_TopologyAttr('masses', from_attrs=['names', 'resnames'], 
                         context='my_martini')

perhaps prompted by a helpful warning message.

I don't think it would be practical to try to keep up with force field updates.

richardjgowers commented 4 years ago

I like this idea. To play devil's advocate, I'm not keen on adding more lines to load a Universe - part of what the package does is try and hide complexity, ie masses are magically guessed often. But maybe it's cleaner and more correct to not guess anything by default to force users to understand what is from their file and what is derived information.

lilyminium commented 4 years ago

@richardjgowers as masses is a mandatory attribute, parsers would guess it themselves. guess_TopologyAttr just overwrites whatever values previously existed. For magical guessing, if we write the mass guesser to also consider residue names when present:

>>> context = Context.from_files('martini.itp', 'martini_aa.itp',
                                 'martini_ions.itp', name='my_martini')
>>> u = mda.Universe('martini.gro', context='my_martini')

Under the hood, GROParser dumps all the available topology attributes into Masses.guess_from, and Masses.guess_from then chooses which ones to use and how. Probably in order:

-> match elements if valid, else
---> match types if valid, else
-----> match atom names [and resnames if present], else
-------> 0.0

If the user chooses not to load a new context, the result would wind up being the status quo with masses of 0.0 (assuming MARTINI particles aren't in base.)

>>> u = mda.Universe('martini.gro', context='base')

Edit: I agree that masses are important and should be guessed, and this may also apply to elements -- but I think a way to easily guess using different information to the tables in guessers, and to easily overwrite the attribute with a guess that the user manually makes, would be really convenient.

richardjgowers commented 4 years ago

So would it make sense for the context to do the guessing?

ie

context = Context.from_files(...)
masses = context.guess_masses(topology)  # modifies topology in place?

Where the default .guess_masses is just our current function

xiki-tempula commented 4 years ago

MARTINI all have either 72 (which is the case for non-polarizable version) or 36 and 72 for (polarizable version). The particle which has a mass of 36 can be easily picked up by the presence of the name.

for residue in protein.select_atoms('name D').residues:
    if residue.resname == 'LYS':
       residue.atoms.select_atoms('name Qd D').masses = 36
    elif residue.resname == 'THR':
       residue.atoms.select_atoms('name N0 D').masses = 36
jbarnoud commented 4 years ago
>>> context = Context.from_files('martini.itp', 'martini_aa.itp',
                                 'martini_ions.itp', name='my_martini')
>>> u = mda.Universe('martini.gro')
>>> u.guess_TopologyAttr('masses', from_attrs=['names', 'resnames'], 
                         context='my_martini')

I really like that. I would just advocate to ditch the name here, and to refer directly to the object. Just for the sake of reducing complexity.

lilyminium commented 4 years ago

So would it make sense for the context to do the guessing?

It could! I left it as a classmethod because I like x.from(y), but it’s probably easier to subclass and modify as a Context method. However my method allows users to match values that are not in the topology by passing it in themselves, which I thought might matter in some way.

I would just advocate to ditch the name here, and to refer directly to the object. Just for the sake of reducing complexity.

Sure. I would leave the name as an option just because it follows the topology format and because MDAnalysis might like to offer some default contexts.

jbarnoud commented 4 years ago

MARTINI all have either 72 (which is the case for non-polarizable version) or 36 and 72 for (polarizable version). The particle which has a mass of 36 can be easily picked up by the presence of the name.

for residue in protein.select_atoms('name D').residues:
    if residue.resname == 'LYS':
       residue.atoms.select_atoms('name Qd D').masses = 36
    elif residue.resname == 'THR':
       residue.atoms.select_atoms('name N0 D').masses = 36

This is not completely accurate. The "small" beads used (mostly) in rings have a mass of 45 amu. You need to know the topology of the residue to know which beads are small. Added to that, the upcoming version of the force field has 3 bead sizes with different masses and many more molecules. But it is just an example, other cg force fields do have different masses for each particles.

zemanj commented 4 years ago

tl;dr: We should not guess anything by default.

I'd like to take up what @richardjgowers said:

But maybe it's cleaner and more correct to not guess anything by default to force users to understand what is from their file and what is derived information.

In my opinion, this is not a "maybe". It may appear like a nice convenience feature if properties can be guessed automatically. However, there is no scientifically sound way to get around actually providing the correct numbers to obtain reliable data. Results obtained with the help of MDAnalysis are regularly published in scientific papers. Any such result must not rely on educated guesses but on facts. And guessing masses is not "deriving". It is guessing. After all, we cannot know for sure which masses/charges/whatever were used in the simulation unless the user provides this information. Assuring that people do their research properly is certainly not our responsibility as MDAnalysis developers. However, it should be our goal to enable people to do things correctly. In that respect, guessers are a neat feature, but from a scientific standpoint they're rather dangerous. Don't get me wrong, I think we should definitely have smart guessers. I just think they really should be disabled by default, and we should take great care letting the user know that guessed properties must be carefully checked.